TODO


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155

- AVX 512

  - More registers, optional operands, opmasks.

    Potential solutions:

    * encode all the additional stuff into one new operand

      OPTION(k0, z, broadcast)

    * Add pseudo-registers

    	    k0m, k1m, k2m, ..., k7m
	    k0z, k1z, k2z, ..., k7z

      that corresponds to opmasks with zeroing/masking. This assumes
      that opmasks are only used when zero/masking is also allowed,
      which may not be true.

    * Just add new mandatory ops:

    	     vaddps, zmm, k*, m/z, zmm, zmm, option

      and require the user to specify, even when they want the
      default.

    * Support optional operands. Is this doable? It will require some
      intelligence in variant selection. But the optional args will
      have recognizable types, so it may not be impossible.

    * It looks like the optional operands are generally considered
      decorations on another existing operand.

  - New disp8 addressing scheme. Requires emit_reg_regm to deal with
    it and produce additional information to be stored in EVEX.

  - There may (or may not) be additinal XMM and YMM registers, so we
    now need the ability to choose EVEX encoding on the fly. The way
    to do this is probably to introduce new optypes XMM_LO, YMM_LO,
    ZMM_LO(?), and then change XMM to be XMM_LO | XMM_HI. Ie., similar
    to the existing cases where some instructions can be encoded more
    efficiently when the register is ax, in this case some
    instructions can be encoded with VEX when the registers are
    *mm0-*mm15

  - A problem is that some of the new instructions have the same name
    as existing ones, but different number of arguments. For example

        VPGATHERDD zmm1 {k1}, vm32z 

    in AVX-512 vs

	VPGATHERDD xmm1, vm32x, xmm2

    in AVX-2. There are also things like

        VADDPS zmm1 {k1}{z}, zmm2,  zmm3/m512/m32bcst {er} 

    vs

        VADDPS ymm1, ymm2, ymm3/m256

    A possibility is to just rename the avx-512 instructions to use a vz prefix
    instead of v.

- Size directives

  - There are a number of cases where the op size can't be inferred
    from the instruction. For example,

    	 movzx eax, PTR(ebx)

    How many bytes should be zero-extended? In this case we have coped
    by simply adding movzx.8/16/32 variants.

    But there are other cases, like 

    	      add  PTR(ebx), imm8

    which can have four sizes too. That goes for all the ALU ops in
    fact.

    Proposed solution: Add new OP_MEM8/16/32/64 and RIP_REL8/16/32/64
    types that are designed such that they can be generated with

    	     BYTE_PTR
	     WORD_PTR
	     DWORD_PTR
	     QWORD_PTR

    macros that are appended to the INDEX/BASE/PTR/RIPREL macros.

    - The A_MEM will include the new OP_MEM<size> so that you can give
      the size directive if you want to. And compute_op_size() will
      check that if you do, it is correct.

    - Some instructions will only allow the size directed variants.
    
- The pextrw instruction got a new variant in SSE 4.1 where the
  destination can now be either a memory location or a register. In
  earlier versions it could only be a register. That means this code

         pinsrw eax, xmm1, IMM (7)

should be encoded with the earlier version in preference to the newer
one.

- Consider changing labels to be integers instead of strings. This
  will make it possible in many cases for code to be stored in static
  const arrays.

- Storing the encoding in the info field. Maybe rename info to
  encoding or details.


Feature checking

A block of assembly should have a declared set of features that will
be used. Then, 

   - then the assembler will check that unsupported instructions are
     not inadvertently used

   - at runtime, if the feature set is not supported by the CPU, it
     will report the error gracefully instead of SIGILL.

   - For code that is doing its own checking, escape hatches will be
     necessary.

An potential issue with this is that we may want to specialize at the
individual instruction level rather than at the full assembly block
level.


Test suite:
- Register allocation braindump:
  Register allocator has these methods:
  - begin_spill_context (reg1, reg2, ...)
  - end_spill_context (reg1, reg2, ...)

  when you enter a spill context, all registers not mentioned in the
  arguments become eligible for spilling. When new registers are allocated,
  an eligible one may be spilled.

  When leaving a spill context, all registers allocated in the
  meantime (except those mentioned in the argument) are deallocated, and
  the registers that were spilled are moved back in from memory.

  If there are conflicts between registers that are preserved and
  those that are spilled, the preserved ones will be given new
  locations.

  Everything can fail, and if it does, the whole thing is aborted.

- Generate intermediate array of code with temp variables,
  then run register allocator on that?
  - Ie., an array of { char *mnemonic, uint64_t ops[4] }
  - Extend instruction table to contain information about what is written
    and read. And what is clobbered.
  - Would also allow primitive optimizations such as mov-to-the-same-register
  - Also would allow the possibility of automatically generating 2-operand
    instructions from 3-operand input to deal with both AVX and non-AVX.
  - Also instruction scheduling
  - Some of this could be portable between architectures.


Alternatively, we could make the ops 32 bit, and then pass immediates
and labels as multiple ops. An issue with this is that RIP_REL can be
used as a memory reference, which would mean memory references can be
both one and two ops.

If labels were just numbers, this problem would be easy enough. A sick
hack would be to just compute a hash, and then have a debug mode where
if two labels actually collide an error is printed. The problem though
sit that memory indices also need 64 bits.

Even with 64 bit ops, there might be architectures where we can't rely
on the pointers having room for the OP_ tag.

Probably strings and immediates should just be interned to 24 bit
numbers numbers. They could just be stored in a table in the
assembler.

done:

  -=-=-=-=-=-=-=-

- CRC32 checksum generated machine code

Moving to array of uint64 instead of varargs
- Avoids the issue of enums and 64 bit
- Can still be formatted as if it's assembly language
- If we also change labels to be integers, a lot of the time code can
  be statically generated and stored in static const arrays.
- opens the door to more complex code analysis, perhaps more advanced
  register allocation.

Alignment and values
- Sketich of algorithm is described in a FIXME
- Annotations should have a beginning and end
- An annotation can just be 'code'
- At emit time, the last annotation is closed
- Every instruction in principle results in an annotation,
  but if both this and the previous annotation were just 'code',
  they are compressed.
- This means we can now patch up the code by simply walking the
  list of annotations.
  - maintain displacement
  - switch (annotation type)
    { code: just copy
      jump: expand if necessary, add to displacement
      align: has associated 'skip'. Adjust this skip such that the code
             will align.
             and change displacement accordingly
      label: do nothing
    }
    add displacement to both begin and end.
  - do this until nothing changes.

- The stuff about values below is still good. If the user adds a value
  we have seen before, we should generally reuse it. However, if we
  haven't, it's probably better to keep it close to the actual code
  than storing it somewhere else. Maybe. Or maybe not.

- The assembler right now supports alignment, but it's a pain to keep
  this working when jumps can be converted to 8 bits.
- Also there is no support for reusing the same values in more than
  one generated fast path
- So the assembler should just support ".var" pseudoinstructions that 
  will be used to define values:

       	  .var32  "asdf"   0x00000000ffffffff0f0f0f0f0f000000000000000ffffffff0f0f0f0f0f000000
       	  .var16  "name1"  0xff00ff00ff00ff00ff00ff00ff00ff00
	  .var8   "name2"  0xffff0000ffff0000
	  .var4   "name3"  0xdeadbeef

  and so on. These are made available as labels, but there is no
  guarantee that they will be stored anywhere near where they are
  defined. (Except that they won't be more than 2GB away so that they
  can be used with RIP relative addresses).

  Note that it has to be done this way. Just putting the values in
  some table in pixman itself won't necessarily ensure that the values
  are close enough for RIP relative addressing.

  The assembler will make sure only one copy of each value is stored
  per file in the code manager. The code and the data should probably
  grow from opposite ends of the file.


-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-


New deal:

* Code is generated in single-lane intermediate language with typed 
  variables. Available types:

  	     un8
	     un8x4
	     pointer

* Each backend must decide how to represent those types

* Each backend will vectorize the types as best it can

* There is a "loop" instruction that takes a variable and a bound


For 8888x8x8888, the code looks like this:

    loop ("outer", h, height);

    pointer src_row = src + h * stride;
    pointer mask_row = mask + h * stride;
    pointer dest_row = dest + h * stride;

    loop ("inner", w, width);

    read_u8 (m, mask, 1, w);

    read_4_x_u8 (s, src, 4, w);

    broadcast4 (mmmm, m);

    read_4_x_u8 (d, dest, 4, w);

    mul (s, s, m);

    shuffle (sa, s, SHUFFLE_AAAA);

    xor (sa, sa, 0xff);

    mul_add (d, sa, s);

    write (d, dest);


AVX (and SSE3) has palignr - maybe add something like that to the
intermediate format and emulate it on older instruction sets. Using it
requires an extra register per unaligned input though.

Useful optimizations:

	- Peephole optimizations to eliminate redundant shuffles etc.

	- Dead code elimination - in some cases we will likely end up
	  computing stuff that is not used

	- Move-from-dead-register. Basically, 

		x = y

	  where y is dead should be eliminated. We are going to generate
	  a number of these.

	- Constant propagation could make generation of intermediate
	  code simpler

	- Invariant code motion. Solids could then be generated in the
	  loop itself, rather than being special cased.

Component alpha:

Normal and component alpha can be treated largely the same way by
having the combiner function take (src, alpha, dest), and generating
alpha differently in the two cases.

if (component_alpha)
	alpha = src-alpha * mask;
	src = src * mask;
else
	alpha = src_alpha x 4
	src = src;


The vector size for an operation is determined by

    - the intermediate format
      	  a8
	  a8r8g8b8
	  a16r16g16b16
	  a16?

      and later on 

      	  a32f r32f g32f b32f
 
    - whether multiplications are involved, in which case we may need
      twice the room in the vector.

    - how many pixels at a time

We will never be able to support everything on everything - some
things will just have to fall back to interpretation. Floating point
on an MMX-only CPU for example. (I am not generating x87 code).

It is also not clear that we should care deeply about ARM < v7. If the
generated code is not super optimal, big deal.

So basically, a backend must be allowed to bail out. If it does, the
intermediate code will be interpreted instead.

Another case where bail out may be required is MMX + 16 bit
intermediate.  But an option here is for the back end to allocate two
mmx registers as one 16 byte register.

It is tempting to load the mask as a8a8a8a8, but it is probably better
to keep it simple and load it as a8x8x8x8, shuffle as necessary, then
peephole optimize later.

	 For example

	     m = m << 24
	     m = shuffle (m, 3, 3, 3, 3);

	 can be peepholed into

	     m = shuffle (m, 0, 0, 0, 0);

The required vector size is 

    MAX (n_pixels * intermediate_size, 2 * intermediate_size)

The 2 * intermediate_size comes from the fact that the frontend won't
deal with shuffling across different registers. Ie., you can't put a 4
channel pixel into two separate registers. The backend will have to
fake a big enough register if it doesn't want to bail.

So while backends should report their preferred vector size to allow
the front end to determine a good number of pixels per operation, they
*will* be faced with 16 byte types, including 4 * float and 4 * u32.

The max number of pixels is determined by

      if (pref_vs < 2 * intermediate_size)
      	 n_pixels = 1;
      else
         n_pixels = pref_vs / intermediate_size

pref_vs has to be a multiple of 4.

And the vector size that we load stuff into is given by

    vector_size = MAX (n_pixels * intermediate_size, 2 * intermediate_size)

(And the "2" is only if the op/intermediate combination requires 
 extra space for multiplication).

Ie. the vector type to be used is

    (vector_size / intermediate_size, intermediate_type).

The code generated is then

    while (w)
    {
	while (w >= 1 && !aligned_to_two_times_intermediate (dest))
	{
	      <code (n_pixels=1, vector_size, intermediate_format, 
	      	    op, src, mask, dest);>

	      w--;
	}

	while (w >= 2 && !aligned_to_four_times_intermediate (dest))
	{
              <code (n_pixels=2, vector_size, intermediate_format, 
	      	    op, src, mask, dest);>
	      
	      w--;
	}
	...
	while (w >= n_pixels)
	{
	      code (n_pixels, vector_size, intermediate_format, 
	      	    op, src, mask, dest);
		
	      w--;
	}
    }

ie.

[while]

for (n_pixels = 1; n_pixels <= max_pixels; n_pixels *= 2)
{
	[while]

	[end while]
}

[end while]

Here is code to compute the various properties:

static void
compute_stuff (int op,
	       pixman_format_code_t src,
	       pixman_format_code_t mask,
	       pixman_format_code_t dest)
{
    const char *intermediate;
    int pref_vs;
    int intermediate_size;
    int mult;
    int n_pixels;

    if (op == PIXMAN_OP_ADD && mask == PIXMAN_null &&
	src == PIXMAN_a8 && dest == PIXMAN_a8)
    {
	intermediate = "a8";
	intermediate_size = 1;
    }
    else if (PIXMAN_FORMAT_16BPC (src) ||
	     PIXMAN_FORMAT_16BPC (mask) ||
	     PIXMAN_FORMAT_16BPC (dest))
    {
	intermediate = "a16r16g16b16";
	intermediate_size = 8;
    }
    else
    {
	intermediate = "a8r8g8b8";
	intermediate_size = 4;
    }

    if (strcmp (intermediate, "af8rf8gf8bf8") == 0)
    {
	mult = 1;
    }
    else
    {
	mult = 2;

	if (mask == PIXMAN_null)
	{
	    if ((op == PIXMAN_OP_ADD)				||
		(op == PIXMAN_OP_OVER && PIXMAN_A (src) == 0)	||
		(op == PIXMAN_OP_SRC))
	    {
		mult = 1;
	    }
    }

    for (pref_vs = 4; pref_vs <= 16; pref_vs *= 2)
    {
	int vector_size;
	
	if (pref_vs < mult * intermediate_size)
	    n_pixels = 1;
	else
	    n_pixels = pref_vs / intermediate_size;

	vector_size = max (n_pixels * intermediate_size,			   mult * intermediate_size);
	
	printf ("%2d vector type: <%d x %s> (%d pixels)\n",
		pref_vs,
		vector_size / intermediate_size,
		intermediate,
		n_pixels);
    }
}


Larrabee:

Larrabee does not fit this pattern because it can't do arithmetic on 8
bit channels; it always expands to at least 32 bits. On the other
hand, it has 32 registers, so we could simply fake the un8 type with
extra registers.

A better approach is to have the backend give the number of pixels it
can deal with for a given type. The only problem with that is that the
backend will have to artificially limit the number of pixels per
register since it has to be able to do multiplications. Hmm, on the
other hand, the front end will need to write out registers to memory,
so the packed representation is necessary. 

But maybe only for output purposes? If the intermediate format has
Larrabee-like instructions, maybe the backends can optimize
themselves?

We do need the typed vectors so that Larrabee can use the generous
representation internally for things like un8. (Ie., we can't just use
16 bit integers there and do the shift tricks in the frontend, because
Larrabee would prefer to do things differently).

For un8 types, maybe there should actually be two types: un8 and un8m,
where multiplications are only allowed on the 'm' variation. Another
possibility is to forget about the multiplication optimization and
always reserve enough space. For the vast majority of operations, we
do need multiplications, and for those where we don't we could just
add special fast paths. 


This seems like the correct approach:

	- vector variables are typed; for example they could be un8 or
	  u16, or f16, f64 etc.

	- conceptually, and as far as the frontend is concerned, the
	  vectors are fully packed, so if you write them to memory,
	  they will be stored in a packed format. Ie., the front end will
	  consider a vector of 16 un8s to be 16 bytes wide.

	- Internally, the backend may choose a different
	  representation (and almost all backends will need to use 16
	  bits for the un8 type).

	- When vectors are written to memory, the backends will need
	  to do whatever packing and unpacking they need to undo their
	  internal representation.

	  The SSE2 backend would want to optimize two memory stores
	  into one pack + move, rather than two
	  pack-with-0-then-store.

One issue with this is what to do about several multiplications in a
row. These can be optimized because there is no need to divide with
255 until all the adds have taken place. Of course, if a backend
wishes to, it may consider the internal type "un88", even when the
external type is un8. That just means that in addition to packing, it
first has to divide by 255. And to do this it has to prove that the
intermediate results can not overflow.

Another issue is what to do with formats like 565 or a1 or
weirder. And is it possible to multiply two vectors together with
different types? If it is, what is the resulting type? The answer is
likely that you can't multiply vectors with different types, so the
front end will need to decide on an intermediate format, and generate
code to load it. This means we will need to be able to convert from n
of one type to n of another.

r5g6b5 should probably be treated specially because the channels have
different widths. Maybe we just need a convert-r5g6b5-to-8888
instruction. With typed vectors, it is quite difficult to generate
code to convert for it.

What about for a8? 

	- find out how many un8s we can deal with

	- divide by 4

	- load that many u8s

	- convert to u32
	
	- shift/mask/etc.

Algorithm:

	- Determine intermediate format (max of all the involved
	  formats, though at least un8).

	- Given intermediate format, determine n_pixels for source,
	  mask and destination. Take minimum.

	- src = load (4 * n_pixel, format)
	  mask = load (4 * n_pixel, format)
	  dest = load (4 * n_pixel, format)

It is the backend's problem how it wants to represent those
internally. For example if there is no mask, and we are compositing
a8r8g8b8 x a8r8g8b8, then the SSE2 backend might report that it can
deal with 4 pixels. Internally, one 4 pixel IR register would be
stored in 2 xmm registers, unpacked. (Problem: a simple-minded
liveness analysis well not know that the two xmm registers could die
separately). Though there could be a "split" pass that would split
such variables into two.  Or there could be IR instructions that would
load two registers. Though, how would the frontend know to do that?
And how would the split pass deal with loads?

Intermediate format is probably always given as "un16" or "f32" or
somesuch (as opposed to a16r16g16b16 etc.)

All the backends need to do then is to provide a
backend_get_n_pixels() that returns the number of pixels it can handle
for a given type. Or the number of lanes, really.

It is likely to be useful to allow backends to do their own rewriting
before register allocation. For example, the arm backend will probably
want to replace a shuffle-multiply with something that it can use to
generate normal multiplication code with.

Similarly, the mmx backend can only do word-level shuffling, so it will
want to replace

     	shuffle (<byte vector>)
	expand  ()
	
with

	expand
	shuffle (<word vector>)

These types of things should be done on the dependency graph of the
intermediate code. (And should be therefore restricted to things that
are semantically *identical*, not just equivalent)

        t = shuffle_byte (s, ...);
	v = expand (t)

can be transformed into 

        t = expand (s)
	v = shuffle (t)

but only if there are no other dependencies on t.

Alternatively, it could be turned into

	       t = shuffle_byte (s, ...);
	       tt = expand (s);
	       v = shuffle (tt);

with the hope that the initial shuffle_byte would then be dead. The first 
one seems simpler:

    	  - no termination arguments necessary

	  - no dead code elimination

	  - potentially less powerful, but meh.

Intermediate format

Benefits: 

	- Can be interpreted. Eventually only the jit code should be
	  necessary

	- Multiple backends with different size registers

	- Register allocation

	- Optimization?


- Typed variables. Should there be a 0.8 type? The intel (and other)
  backend can do the expanding itself, and ARM can do more efficient
  multiplication.

intel:
	src, mask loaded
	nmask = xor (mask, 0xff)
	s = src * mask
	    	  expand (src) -> src1, src2
		  expand (mask) -> mask1, mask2
		  pixmul
		  pixmul
		  pack -> s
	d = dest * nmask
	    	  expand (dest) -> ...
		  pack -> d
	nmask = shuffle (src, 
	s = d (+) s
	store s in dest

which is just as good as before. In fact it's a little better because
we save a negation instruction. Actually, we should keep the computed
alpha around, and then negate afterwards, and then multiply. So there
will be extra shuffling going on with this scheme.

Arm
	src, mask loaded
	nmask = xor (mask, 0xff)
	s = src * mask:
	         uxtb16 s_low, src
		 uxtb16 s_hi, src, ror 8
		 
		 mla s_low, s_low, mask, 0x80
		 mla s_hi, s_hi, mask, 0x80
		 uxtab16 s_low, s_low, s_low, rot 8
		 uxtab16 s_hi, s_hi, s_hi, rot 8

		 and s_hi, 0xff00ff00
		 uxtab16 s_low, s_hi, s_low, rot 8


One difference is that Arm wants the mask as a single byte whereas
intel wants it duplicated. The correct solution may be to simply read
it as an ARGB, then the intel can shuffle and the Arm can do its own
multiply with its builtin shifting.

Another possibility is to add an IR instruction to multiply a vector
with one number. On intel this would result in a pshuf + multiply. On
ARM it could be done with just two mulitplications. On intel we would
actually need to multiply with a number given in a field of another
vector.

Unfortunately, this can't be implemented on ARM without assuming that
there is room for overflow.

An intelligent instruction selector on ARM would probably be a big
benefit.

Or just assume that NEON is the future. It seems saner than v6 SIMD.


- Three-register instructions

It is what future x86s will have, it is what ARM wants, it is much
easier to generate code for, and it is easy to turn three-register
code into two-register code, but not the other way around.


Older notes:

- The generated ops should have a simpler prototype than the normal one. Maybe
  something like.

  void (* CompositeOp) (uint32_t *src_start,
			uint32_t src_skip,	/* = src_stride - (width * src_bpp) / 8 */

       	  	       	uint32_t *mask_start,
			uint32_t mask_skip,

			uint32_t *dest_start,
			uint32_t dest_skip,

			uint16_t width,
			uint16_t height);
			
  The amount of setup in generated code should be minimized, both to simplify 
  code generation and to reduce the memory overhead of the code generation.

  For ops where source or mask is solid, src/mask_start should point to an 8888
  pixel arranged similarly to the dest format. Ie., unpacking should
  happen before the op is called.

  If we add transformations and filters, they can be added at the end of the
  argument list - that way the code won't have to change too much.

  The problem with this though is that we need essentially random access to
  do transformations. It also doesn't work for gradients.

- It should be figured out how to deal with alignment restrictions.
  There is movdqa, which requires 16 byte alignment, and there is
  movdqu, which doesn't. Agner says movdqu is slow and may better be
  replaced with two movq's. Maybe it's best to just always use correctly
  aligned reads:

  	  - Generate a version for every combination of alignments?

	  - Dynamically determine the instruction to use?
	       - This is not necessarily that slow since branch prediction will
	         usually get it right.
	       	

  In all cases we will need to be able to handle single pixels. So maybe start out
  just generating those. The optimal thing to do would be. 

       	- compute 

	  	  n_pixels(sse, op, src_format, mask_format, dest_format)
	    
 	  the number of pixels that fits in a register.

	- For each line compute the initial unaligned strip of pixels. If this is
	  the same for all three, then set sw to that number, otherwise set
	  it to the full width.

	  inner_loop:

	  tmp_wid = width;	with tmp_wid on the stack.

	  w = that_number;
	  tmp_wid -= sw;

	- while (sw--)
	  {
		do_one_pixel;
	  }

	  Then compute the number of aligned pixels that can be done, and

	  while (aw--)
	  {
		do_full_size()
	  }

	  Finally

	  while (w--)
	  {
		do_one_pixel;
	  }

	  the "do_one_pixel" loop can be in a separate function to avoid duplicating
	  the code. In fact, both one and full size could be the same code with
	  a parameter to determine the difference. Branch prediction would make
	  this relatively inexpensive.

	  Although, the full size code is quite different since it has
	  to deal with packed pixels.

  	Also we need the images to be identically aligned - it's not good
  enough to only align say the source to 16 bytes.
  It may be worth having an initial pass that deals with any leading and
  trailing unaligned data.
  	   Dealing with alignment
  
	- Size of operation:	can be 1 or 2 bytes 
	       (depending on whether it's multiplication or just addition)

	- available_channels = vector_size / op_size;

  In principle we could always read in as many pixels as possible from the
  shortest input. Ie., if we have an a8 mask, composite 16 pixels at a time,
  unrolling as necessary. The register pressure is likely to make this a loss
  though.

	1 register set to 0

	1 register to hold the 16 mask pixels
	1 register to hold the 4 src pixels
	1 register to hold the 4 dest pixels

	1 register to hold two expanded src pixels
	1 register to hold two expanded dest pixels
	1 register to hold two expanded alpha pixels

  that's 8 registers already, which is all we have got unless we are in 64 bit mode.
  We could make use of mmx registers too - the currently inactive pixels could be
  held there. Ie.,

       movdqa *src, xmm0
       movdq2q xmm0, mm0
       psrldq xmm0, 64
       movdq2q xmm0, mm1

       movdqa *mask, xmm0,
       movdq2q xmm0, mm2
       psrldq xmm0, 64
       movdq2q xmm0, mm3

       movdqa *mask, xmm0,
       movdq2q xmm0, mm2
       psrldq xmm0, 64
       movdq2q xmm0, mm3

  A simple, but probably pretty good scheme:

  - Generate two versions of each op, one where everything is aligned,
    and one where alignment is detected on the fly.

    In both cases n_pixels is computed, the number of pixels to handle
    per iteration. In the aligned case, we then just read in that many
    pixels as efficiently as possible. For the first iteration, that
    probably means 2 pixels in many cases, but eventually it would be
    nice to unroll once to get to four pixels.

    So both versions have a preamble that reads the source, mask and
    destinations into sse registers. Then afterwards, the computations
    are the same, then finally two different versions generate the
    final write to the destination.

  Another possibility:

  - Deal with unaligned or short lines in separate loops before and after the
    aligned loop. Could be in a function.

  Current thinking:

  - Just ensure destination is aligned, and use movdqu for sources.


  Computing the rendering:
 
  - Reserve three registers  for 0x00..  0x0080.. and 0x00ff

  - Compute (src IN mask) and (srca) where (src IN mask) is returned as a
    packed vector, and srca1 and srca2 are expanded. (srca is only computed
    if actually needed). 

  - Read dest into one of the remaining two registers 

  - In the last register expand low part of d and combine it onto src1

  - In the last register expand high part of d and combine it onto src2

  At this point src1 and src2 represents the 'fd' that will get added onto s

  - Pack src1 and src2 together in src1 (which is now the fd we will add to s)

  - Unpack low part of s into src2 (mov s, s2; unpack z, s2);

  - Unpack low part of d into last register 

  - Shuffle last register to become dest alpha

  - Combine onto src2

  - Unpack high part of s into last register

  - Unpack high part of d into itself

  - Combine onto s

  - Pack s and src2 into src2

  - Add src1 and src2

  - Store the result.

Unfortunately, we are one register short of being able to do F_a's and
F_b's where both src_a and dst_a are involved.

Note: there is no integer division in SSE, so this would possibly have
to be done with floating point. Ie,. using cvtqd2ps.

- It is important that the register allocator is not too dumb

     - EAX is the only register we can use for multiplication
     - There are not a lot of registers.
     - Need to do something sensible when we run out of registers.
       Spilling to memory is a possibility - just giving up is another.
  
  We need the ability to unallocate a register, and then use it in a
  following instruction. That way a register can be reused in the
  middle of an instruction.  (moves from a register to itself should
  be culled).

  This probably also implies that we need the ability to ref count
  registers, and that callees will often have to take ownership of a
  passed-in register.

  Pixels are always 1, 2, 4, or 8 bytes wide, which means we can make
  good use of the x86 addressing mode: Only one register is necessary
  for the X coordinate, we can then index into images using
  displacements and shifts:

       while (h--)
       {
	     s_line = src_pixels + h * s_stride;
	     m_line = mask_pixels + h * m_stride;
	     d_line = dest_pixels + h * d_stride;

       	     while (w--)
	     {
	         s = s_line + w * s_bpp;
		 m = m_line + w * m_bpp;
		 d = d_line + w * d_bpp;
	     }
       }

  Registers:   h, w, s_line, m_line, d_line,               

  Or:

	    s = src_pixels;
	    m = mask_pixels;
	    d = dest_pixels;

	    while (h--)
	    {
		while (w--)
		{
		    [s + w * bpp]
		    [d + w * bpp]
		    [m + w * bpp]
		}

		s += s_stride;
		m += m_stride;
		d += d_stride;
	    }

  Registers: h, w,  s, d, m, s_stride, m_stride, d_stride.


- Optimistic Register Allocator

  Simply hand out registers as code asks for them. If we run out of registers,
  then spill the register somewhere on the stack. (At emit time we will make room
  for spills).

  When a spilled variable is being used, some random other variable is
  stored on the stack, and the used variable gets the
  register. Possibly using the LRU algorithm to decide the
  eviction. 

  Before jumps and before labels, a 'normalization' is run where
  things are put into the locations they are supposed to be
  in. Ie,. all variables have two fields: where *are* they, and where
  do they *belong*. The invariant is that at entry to any basic block,
  variables are in their assigned location.

  Combined with not register allocating constants, this really should
  be good enough.

- Code generator / runtime assembler

  Eventually, the x86_codegen.h file should be folded into
  codex86.c. Things like membase emission can be done straight from the
  ops. 

  - Need to check malloc() returns.

  - An SSE/MMX register allocator is probably required. It would be
    nice if we could figure out up front how many registers we are
    going to need so that constants can be register allocated and 
    hoisted out of the loop when possible.

    Have a "dry run" mode that just finds out which registers are
    used?

  - Should probably swap arguments to memindex and membase to make
    them look more like at&t syntax. Also find out where to put
    immediate operators. Maybe I need to face the fact that at&t
    syntax is crack.

  - The code could be made smaller and faster if op_t were made
    smaller by compressing fields into uint8_t's. This would cause gdb
    to not show registers in enums though.

  - Decide on syntax: AT&T vs Intel? Right now we are mostly at&t, but
    that's kind of weird for things like test, and cmp.

- Public API:

  pixman-sse-jit.h:

  pixman_sse_jit_t   *pixman_sse_jit_new ();
  pixman_sse_code_t  *pixman_sse_jit_get_op (pixman_sse_jit_t			*jit,
					     pixman_op_t			 op,
  		     			     const pixman_sse_image_info	*src,
					     const pixman_sse_image_info	*mask,
					     const pixman_sse_image_info	*dest);
  void                pixman_sse_code_run   (pixman_sse_code_t			*code,
					     pixman_op_t			 op,
					     pixman_image_t			*src,
					     pixman_image_t			*mask,
					     pixman_image_t			*dest,
					     int				 x_src,
					     int				 y_src,
					     int				 x_mask,
					     int				 y_mask,
					     int				 x_dest,
					     int				 y_dest,
					     uint16_t				 width,
					     uint16_t				 height);

  where pixman_sse_image_info would contain information about the format, whether the
  image is solid, whether the image is there at all (for the mask).

DONE:

- Pixman CPU detection should be generated dynamically. That will get
  rid of the annoying #ifdefs and getisax() stuff.

  - Backwards vs. forwards

    The code currently in testjit iterates backwards over each line. It
    may be a little better to go forward. This could be done by

      - initializing the line to (line + w * bpp)
      - initializing w to -width
      - not having a displacement:

      	    movq (line, w, bpp), xmm0

	and 

	    add 2, width

    Current code iterates forwards.

  - The memindex/membase should take ops, not reg numbers.

  - There should only REG and MEM in the ops. emit_memindex() can
    handle everything. 

  - Assembler needs to deal with labels. API

    	      x86_jz (a, "no_cpuid");
    	      asm_label (a, "no_cpuid");

    The assembler would just store a big list of labels/positions, then
    patch up at asm_emit() time.

  - REG ops can contain any register whatsoever (xmm, mmx, eax, al
    etc.). There should be a separate op_get_regno() function to extract
    the register number from an op.

  - There should be eax() type functions.

  - Renaming mov to movl was a mistake. The assembler should be able to
    figure out the size of the operation by itself. (By looking at
    the involved registers).