Generating PA and CPset form a protein
Let us consider the single-chain protein 2igd as our example. We can easily extract the followings xyz co-ordinates for its C-alpha atoms of the AA residues from its PDB format file.
AA ID |
AA Residue
Type |
x |
y |
z |
1 |
M |
1.5 |
3.5 |
5.7 |
2 |
T |
0.9 |
7.1 |
6.7 |
3 |
P |
4.0 |
9.0 |
7.5 |
4 |
A |
5.2 |
9.3 |
11.1 |
5 |
V |
4.6 |
12.8 |
12.4 |
6 |
T |
6.1 |
14.5 |
15.4 |
7 |
T |
4.6 |
17.4 |
17.5 |
8 |
Y |
7.0 |
20.3 |
17.6 |
9 |
K |
6.7 |
23.2 |
20.1 |
10 |
L |
7.5 |
26.9 |
19.7 |
11 |
V |
8.3 |
29.0 |
22.8 |
12 |
I |
7.7 |
32.7 |
22.0 |
13 |
N |
9.4 |
35.5 |
24.0 |
14 |
G |
8.2 |
38.4 |
21.9 |
15 |
K |
7.7 |
42.1 |
22.6 |
16 |
T |
3.9 |
41.8 |
22.4 |
17 |
L |
3.3 |
38.1 |
21.7 |
18 |
K |
4.3 |
35.6 |
24.4 |
19 |
G |
3.6 |
32.0 |
25.1 |
20 |
E |
3.7 |
28.6 |
23.3 |
21 |
T |
2.2 |
27.0 |
20.2 |
22 |
T |
2.6 |
23.7 |
18.4 |
23 |
T |
2.7 |
22.2 |
14.9 |
24 |
K |
2.8 |
18.6 |
13.6 |
25 |
A |
5.3 |
17.7 |
10.8 |
26 |
V |
7.3 |
14.9 |
9.3 |
27 |
D |
10.6 |
16.5 |
10.0 |
28 |
A |
12.1 |
19.5 |
11.8 |
29 |
E |
12.7 |
21.5 |
8.6 |
30 |
T |
9.0 |
21.5 |
7.8 |
31 |
A |
8.1 |
22.5 |
11.4 |
32 |
E |
10.7 |
25.3 |
11.2 |
33 |
K |
9.1 |
26.7 |
8.1 |
34 |
A |
5.7 |
26.7 |
9.7 |
35 |
F |
6.9 |
28.5 |
12.8 |
36 |
K |
8.9 |
31.1 |
10.8 |
37 |
Q |
5.8 |
31.8 |
8.8 |
38 |
Y |
3.8 |
32.1 |
12.0 |
39 |
A |
6.3 |
34.5 |
13.5 |
40 |
N |
6.4 |
36.6 |
10.3 |
41 |
D |
2.5 |
36.7 |
10.1 |
42 |
N |
2.5 |
38.1 |
13.7 |
43 |
G |
5.3 |
40.6 |
13.3 |
44 |
V |
7.9 |
38.8 |
15.3 |
45 |
D |
11.5 |
39.4 |
14.3 |
46 |
G |
13.9 |
37.6 |
16.6 |
47 |
V |
16.6 |
35.1 |
17.1 |
48 |
W |
16.0 |
31.4 |
17.1 |
49 |
T |
17.1 |
28.1 |
18.7 |
50 |
Y |
16.2 |
24.5 |
18.0 |
51 |
D |
16.7 |
21.6 |
20.4 |
52 |
D |
16.4 |
18.3 |
18.7 |
53 |
A |
16.2 |
16.4 |
22.0 |
54 |
T |
13.0 |
18.2 |
23.1 |
55 |
K |
11.6 |
19.0 |
19.6 |
56 |
T |
11.4 |
22.7 |
20.7 |
57 |
F |
12.1 |
25.9 |
18.8 |
58 |
T |
12.5 |
29.2 |
20.8 |
59 |
V |
12.3 |
32.7 |
19.4 |
60 |
T |
13.2 |
35.9 |
21.4 |
61 |
E |
13.0 |
39.5 |
20.5 |
Running Stride:
Then, we can run STRIDE algorithm to get the following SSE information from these 3D co-ordinates. ‘H’ means an alpha helix and ‘E’ means a beta sheet.
SSE ID |
SSE Type |
Start AA ID |
Start AA |
End AA ID |
End AA |
Length |
1 |
E |
6 |
T |
13 |
N |
8 |
2 |
E |
18 |
K |
25 |
A |
8 |
3 |
H |
28 |
A |
42 |
N |
15 |
4 |
E |
47 |
V |
51 |
D |
5 |
5 |
E |
56 |
T |
60 |
T |
5 |
SSEs as 3D vectors:
Then, we can treat each SSE as a 3D vector (line segment) and calculate the start and the end points for each SSE vector using the equations by Singh and Brutlag, 1997.
SSE ID |
Start point |
End point |
||||
x |
y |
z |
x |
y |
z |
|
1 |
5.4 |
16.0 |
16.4 |
8.6 |
34.1 |
23.0 |
2 |
3.9 |
33.8 |
24.7 |
4.1 |
18.2 |
12.2 |
3 |
10.5 |
21.3 |
9.6 |
4.4 |
36.5 |
11.7 |
4 |
16.3 |
33.2 |
17.1 |
16.4 |
23.1 |
19.2 |
5 |
11.7 |
24.3 |
19.8 |
12.7 |
34.3 |
20.4 |
Protein Abstract
(PA):
Then, we can form derive our Protein Abstract (PA) as follows.
|A| |
|S| |
SL |
HL |
HN |
S |
61 |
5 |
41 |
0.3659 |
0.2 |
EEHEE |
Contact Pattern (CP):
In order to generate the CPset, we first have to derive the Contact Patterns (CPs) from the SSEs as follows.
CP ID |
SSE 1 |
SSE 2 |
1 |
1 |
2 |
2 |
1 |
3 |
3 |
1 |
4 |
4 |
1 |
5 |
5 |
2 |
3 |
6 |
2 |
4 |
7 |
2 |
5 |
8 |
3 |
4 |
9 |
3 |
5 |
10 |
4 |
5 |
CP Feature Vector:
Then, we can derive the CP feature vector attribute values for each of the 10 CPs using Equation 8 – 18 as follows.
CP ID |
CT |
AS |
SS |
AD |
SD |
W |
ND |
VD |
MD |
CD |
1 |
3 |
6 |
1 |
12 |
1 |
-165.5378 |
3.3532 |
4.9446 |
10.4063 |
0.6500 |
2 |
2 |
6 |
1 |
22 |
2 |
56.9557 |
9.3186 |
10.0684 |
13.8333 |
0.0000 |
3 |
3 |
6 |
1 |
41 |
3 |
-148.5717 |
9.2396 |
9.7542 |
13.0750 |
0.0000 |
4 |
3 |
6 |
1 |
50 |
4 |
37.0596 |
4.6358 |
4.9040 |
10.1500 |
0.4545 |
5 |
2 |
18 |
2 |
10 |
1 |
-143.0683 |
7.5600 |
7.6400 |
13.6583 |
0.0429 |
6 |
3 |
18 |
2 |
29 |
2 |
74.0149 |
12.4126 |
14.5713 |
16.0500 |
0.0000 |
7 |
3 |
18 |
2 |
38 |
3 |
-133.9085 |
7.9641 |
9.8411 |
12.8500 |
0.0000 |
8 |
1 |
28 |
3 |
19 |
1 |
-151.1989 |
11.2169 |
11.3657 |
14.2800 |
0.0000 |
9 |
1 |
28 |
3 |
28 |
2 |
38.3979 |
9.9986 |
10.6034 |
13.2533 |
0.0000 |
10 |
3 |
47 |
4 |
9 |
1 |
-163.6763 |
4.6464 |
4.9090 |
8.0000 |
1.0000 |
Discrete CP Feature
Vector:
Then, we can generate the discrete CP feature vectors using the default number of bins for each attribute as shown in the Table.
The resulted bin for each attribute value is shown as the binary value.
For example, the default number of bins for the “Closest segment-segment distance (ND)” attribute is 16, the maximum possible original value for this attribute is 64.0 (in Angstroms). So, for CP #2, its ND value 9.3186 can be mapped into bin #2 using Equation 15 in the paper.
bin(9.3186) = floor(9.3186 * 16 / 64.0) = 2 (or 0010 in binary)
CP ID |
CT |
AS |
SS |
AD |
SD |
W |
ND |
VD |
MD |
CD |
Equivalent
Integer |
1 |
11 |
0 |
0000 |
00 |
00000 |
0000 |
0000 |
00 |
00 |
1 |
100663297 |
2 |
10 |
0 |
0000 |
00 |
00000 |
1010 |
0010 |
00 |
00 |
0 |
67114048 |
3 |
11 |
0 |
0000 |
00 |
00001 |
0001 |
0010 |
00 |
00 |
0 |
100672064 |
4 |
11 |
0 |
0000 |
00 |
00001 |
1001 |
0001 |
00 |
00 |
0 |
100676128 |
5 |
10 |
0 |
0000 |
00 |
00000 |
0001 |
0001 |
00 |
00 |
0 |
67109408 |
6 |
11 |
0 |
0000 |
00 |
00000 |
1011 |
0011 |
00 |
01 |
0 |
100669026 |
7 |
11 |
0 |
0000 |
00 |
00001 |
0010 |
0001 |
00 |
00 |
0 |
100672544 |
8 |
01 |
0 |
0000 |
00 |
00000 |
0001 |
0010 |
00 |
00 |
0 |
33555008 |
9 |
01 |
0 |
0000 |
00 |
00000 |
1001 |
0010 |
00 |
00 |
0 |
33559104 |
10 |
11 |
0 |
0001 |
00 |
00000 |
0000 |
0001 |
00 |
00 |
1 |
101711905 |
Discrete CP Feature
Vector Set (CPset):
Finally, we have our CPset for protein 2igd as:
{ 33555008, 33559104, 67109408, 67114048,
100663297, 100669026, 100672064, 100672544, 100676128, 101711905 }