Generating PA and CPset form a protein

 

Let us consider the single-chain protein 2igd as our example. We can easily extract the followings xyz co-ordinates for its C-alpha atoms of the AA residues from its PDB format file.

 

AA ID

AA Residue Type

x

y

z

1

M

1.5

3.5

5.7

2

T

0.9

7.1

6.7

3

P

4.0

9.0

7.5

4

A

5.2

9.3

11.1

5

V

4.6

12.8

12.4

6

T

6.1

14.5

15.4

7

T

4.6

17.4

17.5

8

Y

7.0

20.3

17.6

9

K

6.7

23.2

20.1

10

L

7.5

26.9

19.7

11

V

8.3

29.0

22.8

12

I

7.7

32.7

22.0

13

N

9.4

35.5

24.0

14

G

8.2

38.4

21.9

15

K

7.7

42.1

22.6

16

T

3.9

41.8

22.4

17

L

3.3

38.1

21.7

18

K

4.3

35.6

24.4

19

G

3.6

32.0

25.1

20

E

3.7

28.6

23.3

21

T

2.2

27.0

20.2

22

T

2.6

23.7

18.4

23

T

2.7

22.2

14.9

24

K

2.8

18.6

13.6

25

A

5.3

17.7

10.8

26

V

7.3

14.9

9.3

27

D

10.6

16.5

10.0

28

A

12.1

19.5

11.8

29

E

12.7

21.5

8.6

30

T

9.0

21.5

7.8

31

A

8.1

22.5

11.4

32

E

10.7

25.3

11.2

33

K

9.1

26.7

8.1

34

A

5.7

26.7

9.7

35

F

6.9

28.5

12.8

36

K

8.9

31.1

10.8

37

Q

5.8

31.8

8.8

38

Y

3.8

32.1

12.0

39

A

6.3

34.5

13.5

40

N

6.4

36.6

10.3

41

D

2.5

36.7

10.1

42

N

2.5

38.1

13.7

43

G

5.3

40.6

13.3

44

V

7.9

38.8

15.3

45

D

11.5

39.4

14.3

46

G

13.9

37.6

16.6

47

V

16.6

35.1

17.1

48

W

16.0

31.4

17.1

49

T

17.1

28.1

18.7

50

Y

16.2

24.5

18.0

51

D

16.7

21.6

20.4

52

D

16.4

18.3

18.7

53

A

16.2

16.4

22.0

54

T

13.0

18.2

23.1

55

K

11.6

19.0

19.6

56

T

11.4

22.7

20.7

57

F

12.1

25.9

18.8

58

T

12.5

29.2

20.8

59

V

12.3

32.7

19.4

60

T

13.2

35.9

21.4

61

E

13.0

39.5

20.5

 

 

Running Stride:

 

Then, we can run STRIDE algorithm to get the following SSE information from these 3D co-ordinates. ‘H’ means an alpha helix and ‘E’ means a beta sheet.

 

SSE ID

SSE Type

Start AA ID

Start AA

End AA ID

End AA

Length

1

E

6

T

13

N

8

2

E

18

K

25

A

8

3

H

28

A

42

N

15

4

E

47

V

51

D

5

5

E

56

T

60

T

5

 

 

SSEs as 3D vectors:

 

Then, we can treat each SSE as a 3D vector (line segment) and calculate the start and the end points for each SSE vector using the equations by Singh and Brutlag, 1997.

 

SSE ID

Start point

End point

x

y

z

x

y

z

1

5.4

16.0

16.4

8.6

34.1

23.0

2

3.9

33.8

24.7

4.1

18.2

12.2

3

10.5

21.3

9.6

4.4

36.5

11.7

4

16.3

33.2

17.1

16.4

23.1

19.2

5

11.7

24.3

19.8

12.7

34.3

20.4

 

 

Protein Abstract (PA):

 

Then, we can form derive our Protein Abstract (PA) as follows.

 

|A|

|S|

SL

HL

HN

S

61

5

41

0.3659

0.2

EEHEE

 

 

Contact Pattern (CP):

 

In order to generate the CPset, we first have to derive the Contact Patterns (CPs) from the SSEs as follows.

 

CP ID

SSE 1

SSE 2

1

1

2

2

1

3

3

1

4

4

1

5

5

2

3

6

2

4

7

2

5

8

3

4

9

3

5

10

4

5

 

 

CP Feature Vector:

 

Then, we can derive the CP feature vector attribute values for each of the 10 CPs using Equation 8 – 18 as follows.

 

CP ID

CT

AS

SS

AD

SD

W

ND

VD

MD

CD

1

3

6

1

12

1

-165.5378

 3.3532

4.9446

10.4063

0.6500

2

2

6

1

22

2

56.9557

 9.3186

10.0684

13.8333

0.0000

3

3

6

1

41

3

-148.5717

 9.2396

9.7542

13.0750

0.0000

4

3

6

1

50

4

37.0596

 4.6358

4.9040

10.1500

0.4545

5

2

18

2

10

1

-143.0683

 7.5600

7.6400

13.6583

0.0429

6

3

18

2

29

2

74.0149

12.4126

14.5713

16.0500

0.0000

7

3

18

2

38

3

-133.9085

 7.9641

9.8411

12.8500

0.0000

8

1

28

3

19

1

-151.1989

11.2169

11.3657

14.2800

0.0000

9

1

28

3

28

2

38.3979

 9.9986

10.6034

13.2533

0.0000

10

3

47

4

9

1

-163.6763

 4.6464

4.9090

8.0000

1.0000

 

 

Discrete CP Feature Vector:

 

Then, we can generate the discrete CP feature vectors using the default number of bins for each attribute as shown in the Table.

The resulted bin for each attribute value is shown as the binary value.

For example, the default number of bins for the “Closest segment-segment distance (ND)” attribute is 16, the maximum possible original value for this attribute is 64.0 (in Angstroms). So, for CP #2, its ND value 9.3186 can be mapped into bin #2 using Equation 15 in the paper.

 

bin(9.3186) = floor(9.3186 * 16 / 64.0) =  2 (or 0010 in binary)

 

CP ID

CT

AS

SS

AD

SD

W

ND

VD

MD

CD

Equivalent Integer

1

11

0

0000

00

00000

0000

0000

00

00

1

100663297

2

10

0

0000

00

00000

1010

0010

00

00

0

67114048

3

11

0

0000

00

00001

0001

0010

00

00

0

100672064

4

11

0

0000

00

00001

1001

0001

00

00

0

100676128

5

10

0

0000

00

00000

0001

0001

00

00

0

67109408

6

11

0

0000

00

00000

1011

0011

00

01

0

100669026

7

11

0

0000

00

00001

0010

0001

00

00

0

100672544

8

01

0

0000

00

00000

0001

0010

00

00

0

33555008

9

01

0

0000

00

00000

1001

0010

00

00

0

33559104

10

11

0

0001

00

00000

0000

0001

00

00

1

101711905

 

 

Discrete CP Feature Vector Set (CPset):

 

Finally, we have our CPset for protein 2igd as:

 

{ 33555008, 33559104, 67109408, 67114048, 100663297, 100669026, 100672064, 100672544, 100676128, 101711905 }