forked from gigablast/open-source-search-engine
-
Notifications
You must be signed in to change notification settings - Fork 12
Expand file tree
/
Copy pathDates.cpp
More file actions
27062 lines (25157 loc) · 853 KB
/
Dates.cpp
File metadata and controls
27062 lines (25157 loc) · 853 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
//-*- coding: utf-8 -*-
// stjohnscollege.edu
// - lost event because we changed the implied sections algo and no
// longer adds the address and store hours as a single implied section
// - probably should write this one off
// ingramhillusic.com
// stop it from teltescoping to the month/year pairs in the blog roll.
// detect brothers that are month/year pairs in a list and do not telescope
// to them. set their dates as DF_ARCHIVE_DATES
//. to fix folkmads.org we should allow the 3 tods to propagate up to the
// date above them. then to avoid mult locations for event we should telescope
// all pieces of the date telescope kinda like at the same time until we
// encounter an address. OR allow an hr tag to propagate up unbtil it hits
// text, then set its section therer.
// Dates.cpp revision idea:
// - i might go so far to say that any time you have different dates in
// a section that you are compatible with, then things are ambiguous
// and you should give up entirely with the telescope.
// - we use this algo for assigning addresses i think to event dates
// - we should keep the telescope up until it hits a point of ambiguity
// - but if we can contain 2+ dates from the same section in the same
// telescope then it is not ambiguous and that is ok....
// - how would this affect our other pages?
// - would fix http://www.ingramhillmusic.com/tour/ ?
// - would fix stoart.com ?
// - would fix christchurchcinnati.com?
// --------------------
// list of events with bad times (4) (fix these first)
// --------------------
// http://christchurchcincinnati.org/worship
// - bad implied section, should be based on h2 tag, but it is based on
// a single <p> tag with heading bit set (METHOD_ATTRIBUTE) i think
// - gets some wrong event dates
// - 12:10 should not telescope to "the Sundays" because it has
// "wednesdays" in its title. do we have bad implied sections?
// - misses "ten o'clock" date format
// http://milfordtheatreguilde.org/Larceny.htm
// - gets some wrong event dates
// - seems to be ignore date list: oct 8th, 9th, ...
// - easy fix
// http://www.contemporaryartscenter.org/UnMuseum/ThursdayArtPlay
// - gets some wrong event dates
// - allows a store hours to telescope to all possible combos but in this
// case it should always telescope to the "summer" in its sentence...
// and be required to have that.
// - just allow the plain store hours to be a subdate if compared to a
// store hours that has a seasonal or month range...
// - easy fix
// http://www.stoart.com/
// - what happened to datelistcontainer around the dow list?
// - eliminate addresses that are picture subtitles or are in
// picture galleries. the address is describing the picture not the
// event.
// - asshole's schedule is not aligned with the dows. he relies on the
// browser rendering the two table columns just right...
// - should not be allowing those list of tod ranges to telescope to
// any dow since the dows are in their own list. i thought i had logic
// added recently to prevent this...
// - then if such a thing happens, that list of headers should block the
// telescoping and we end up with just a bunch of tod ranges, and we
// should ignore any even that is just a tod range.
// - likewise, July 2010 should not telescope to Saturday then 1:30-3:30pm.
// it can telescope to Saturday because we allow telescoping to a list
// of headers for the new MULTIPLE HEADER algo, but then the Saturday
// can't telescope to a non-contained or brother list of tods
// - do not consider veritcal lists the same date types, and do not allow
// any other dates to telescope to them or past such vertical lists, also
// the vertical list must be side-by-side with another vertical list for
// this algo to really work. so quite a few contstraints for something
// that is ambiguous anyhow, even if in a side-by-side list format.
// - and address for events is wrong.
// If the wrong address was in a sentence, like I created this work of
// art at 1528 Madison Rd. Cinti OH then we could at least look at the
// structure of the sentence to deduce that it was not talking about the
// events. But it has no sentence context.
// - maybe if the address is in a list of other "things", don't use it...
// - if the address is in a list of brothers, and the tod of the event is
// not a brother in that list, i would say, ignore the address. the idea
// being that the list is independent of the tod. i think this could hurt
// some good pages though...
// - maybe we can fix by noting that the gallery address is unused and
// set EV_UNCLEAR_ADDRESS on the events we do find. or EV_NESTED_ADDRESSES,
// since one address is like the header of the other...
// * HOW TO FIX???
// * HARD FIXES - maybe just leave alone
// --------------------
// end list of events with bad times
// --------------------
// --------------------
// list of events with bad locations (1) (fix these next)
// --------------------
// http://www.so-nkysdf.com/Wednesday.htm
// - i think our METHOD_DOW_PURE fixed these implied sections
// - but why aren't we getting the "Hex" title?
// - ah, our implied sections are the best, they are shifted down by 2!
// - why ddin't a A=25 tagid work out? yeah if we did avgsim it would work..
// http://all-angels.com/programs/justice/
// - each event has a school name in the tod sentence, but we are not
// recognizing that as a place!!
// - need to identify default city/state of a website for getting the schools
// * BAD EVENTS
// * NEED TO IDENTIFY THE DEFAULT CITY/STATE of A WEBSITE
// * SUPPORT "at the following/these locations/places:"
// --------------------
// end list of events with bad locations
// --------------------
// st-margarets.org/
// - missing the thanksgiving eve as the title
// - telescoping to a fuzzy year range 2009-2010, should make that fuzzy
// http://www.ingramhillmusic.com/tour/
// - identify lists of disjoint dates. do not allow those lists to participate
// in the telscoping process. then unless the date you are telescoping
// from is in that list, you must ignore the dates in that list as far
// as telescoping to them as headers. and the dates in that list can't
// be the base of a telescope either.
// - this might be another way to fix thewoodencow.com
// - what about stoart.com, it would prevent the one list of tods from
// combining with the other list of dows. so we would lose most of our
// events for stoart.com
// - this basically would int16_t-circuit our combinatorics approach???
// i.e. "comboTable" in Dates.cpp?
// - i might go so far to say that any time you have different dates in
// a section that you are compatible with, then things are ambiguous
// and you should give up entirely with the telescope.
// - how would this affect our other pages?
// - or just keep it simple and label the dates as DF_ARCHIVE_DATE since
// their month/year list format is very popular. then just ignore such
// dates for telescoping.
// http://www.guysndollsllc.com/page5/page4/page4.html
// - more or less ok. most events are outlinked titles.
// http://www.lilcharlies.com/brewCalendar.asp
// - Sunday should not map to 4pm-6pm but it does because we think 4pm-6pm
// is store hours, but how can we think that? it needs to combine with
// a dow in order to be store hours.
// - how did we get "Sunday [[]] 4pm - 6pm" ???
// - brbrtagdelim (double br) should be enough to keep the right dow mapping
// to the right tod.
// - bad titles because we think the strong tag portion is part of a longer
// sentence. so do not make sentence go across the strong or bold tag
// or italic or underline tag UNLESS the next word is lower case, etc.
// so treat these non-breaking tags as we treat the other breaking tags.
// - BETTER SENTENCE DETECTION (EASY)
// http://sfmusictech.com/
// - hotel kabuki
// - we now get the cocktail event again since i added custom-delimeter
// implied sections
// http://www.guysndollsllc.com/
// - has bad telescope: "until 2:00 a.m [[]] Tuesday through Sunday (Monday)"
// which does not have a real start time. should telescope to
// "4:00 p.m. until 2:00 a.m." since it should be kitchen hours.
// * INCOMPLETE EVENT TIME
// * FIX KITCHEN HOURS
// * FIX ONGOING EVENT DATE TELESCOPES
// http://www.thepokeratlas.com/poker-room/isleta-casino/247/
// - these all seem to be in november 2009 and spidered in may 2010 so the
// dates are old
// - implied sections need help here really
// - 2009 is not being detected as a copyright date which it should be
// cuz in a <div id=copyright>2009 The Poker Atlas</div> tag at bottom of
// the page.
// - BETTER COPYRIGHT DETECTION. telescope around the year's sentence until
// we hit other text. search for "copyright" in all tags telescoped to.
// http://www.southgatehouse.com/
// - misses title "Yo La Tengo" because it thinks it is in a menu and
// gets "Non Smoking Show" as at least the same title score...
// - how to fix?
// * BAD EVENT TITLES
// http://www.cabq.gov/library/branches.html
// - we fixed the titles with our new implied sections
// - title #12 is in same implied section as #11. why? because missing <hr>
// - #1 has a bad event title. why is it getting that google map as title?
// http://www.burlingtonantiqueshow.com/7128.html
// - if city state follows ()'s which follow street, treat it as inlined still
// that way we can get the right address here
// - use the alt=directions link as the site venue. should update the venue
// algo to look at that. also consider "location" or "how to get here/there"
// * DISREGARD ()'s FOR INLINED ADDRESSES
// * UPDATE VENUE ALGO
// * EASY FIX
// http://www.burlingtonantiqueshow.com/
// - no location given, but if we update the venue algo as state above we
// can default the location to the venue.
// * NEED TWO FIXES ABOVE
// * EASY FIX
// http://www.junkmarketstyle.com/item/195/burlington-antique-show
// - seems to be ok now
// http://www.queencityshows.com/tristate/tristate.html
// - July 3 & 4 is resulting in empty times but shouldn't be!
// * FIX INTERVAL COMPUTATION
// * EASY FIX
// http://www.thewomensconnection.org/Programs/Monthly_Meetups_For_Women.htm
// - need to alias non-inlined street address to its inlined equivalent
// * FIX ADDRESS ALIAS ALGO
// * EASY FIX
// http://preciousharvest.com/feed
// - rss content is not expanded... why? need to expand CDATA tags...
// * EASY FIX
// http://www.andersonparks.com/ProgramDescriptions/YoungRembrandtsSummerCamps.html
// - thinks event date is registration date since it is after a
// "register now" link.
// - do not treat date is registration hours if it is 2 or less hours like
// 1 - 2:30pm, because what box office is only open for a few hours?
// * EASY FIX
// abqcsl.org
// - the youth services tod range was telescoping to "Sunday" when we had
// an exception inisCopmatible() to fix folkmads.org, which allowed an
// isolated tod section to telescope its tod to a section that already had
// a tod. but really are the youth services on sunday? that does not
// seem clear really...
// - 3/14/10 should telescope to the store hours, but because a brother
// section has a tod "Oct 18, 1:15PM" it doesn't.
// - 3/14/10 is in a datelistcontainer so it can't be a header
// - it should not be included anyway because its title is outlinked
// - taking out the line in isCompatible() meant for peachpundit.com actually
// seems to bring back the 3/14/10 telescoping to sunday hours event
// http://www.arniesonthelevee.com/
// - needs support for "all week" to get the store hours i think
// http://schools.publicschoolsreport.com/county/NM/Sandoval.html
// - misses santo domingo school because we do note recognize the city
// "sn domingo pblo" which would inline the "I-25 & Hwy 301" intersection.
// - but the elementary school uses a "1" instead of an "I" for "I-25"!
// http://yellowpages.superpages.com/listings.jsp?CS=L&MCBP=true&search=Find+It&SRC=&C=bicycles&STYPE=S&L=Albuquerque+NM+&x=0&y=0
// - "2430 Washington St NE" misses latitude because it is not preceded by
// a zero nor does it have a decimal point in it
// http://www.menuism.com/cities/us/nm/albuquerque/n/7414-south-san-pedro
// - has abq,nm BEFORE the street address
// - we only got it by luck before because the state was in the name2
// and we were calling addProperPlaces on name1 and name2 ... and the
// city abq was in the page title
// * WHAT TO DO? -- scan headers for abq nm??????
// http://www.collectiveautonomy.net/mediawiki/index.php?title=Albuquerque
// . misses event because it can not associtate UNM with Abq, NM
// * NEED BETTER PLACE MAPPING
//. http://www.wholefoodsmarket.com/stores/albuquerque/
// - good titles
// - "STORES" at end should be a menu header but is not
//. http://www.switchboard.com/albuquerque-nm/doughnuts/
// - good titles
// - lost phone # in description when we ignored span/font tags. because
// it is in a div hide tag.
// - thinks switchboard.com biz category line is a menu header now that
// implied sections groups it with that...
//http://www.zvents.com/albuquerque-nm/events/show/88543421-the-love-song-of-j-robert-oppenheimer-by-carson-kreitzer
// - good titles
// - gets "Feed Readers (RSS/XML" as possible title
// - includes quite a bit of menu cruft, hopefully will fade out
// with SEC_MENU... check for 2nd zvents.com url... (it does! see below)
// - we should get the actual title but we get "Other future dates...".
// i guess we should give a bonus if matches the title tag?
// * BONUS IF MATCHES TITLE TAG
// http://www.when.com/albuquerque-nm/venues
// - getting the place name of the event and not the event name because
// the unverified place name has the same title score because it is
// not verified, and because it is to the left of the time, it is
// preferred then.
// * NEEDS MORE PAGES SPIDERED (to verify the place names)
// http://www.zvents.com/albuquerque-nm/events/show/88688960-sea-the-invalid-mariner
// - gets "Feed Readers (RSS/XML" as possible title
// - "Other Future Dates & Times" title...
// * BONUS IF MATCHES TITLE TAG
// . http://texasdrums.drums.org/albuquerque.htm
// - alternating rows in table are all headers... we ignore these for now.
// but do we need header identification or something to do right?
// - STRANGE TABLE HEADERS
//. http://www.usadancenm.org/links.html
// - seems ok, but the best titles are mostly lowercase around the times
// and we are getting address-y titles for the most part now
// * NO CASE PENALTY IF SENTENCE INCLUDES EVENT DATE
//. facebook.com
// - gets "Full" and "Compact" as part of event description, but those are
// options for the "View: ". so we need a special menu detector that
// realizes one item in the list will not be a link because it is a
// selection menu. then "View:" should be flagged as a menu header.
// - any link with a language name like "English (US)" should be
// marked as SEC_MENU if in its own section and is a link.
// * NEED SELECTION MENU DETECTOR
// * IDENTIFY LANGUAGE LINKS AS SEC_MENU
// thingstodo.msn.com
// - best title is "Bird Walk" in a link, but we miss it. we get
// "Upcoming Events" instead because it gets an inheadertag boost. but
// if we spider enough pages i would think it would get a penalty from
// being repeated on other different event pages.
// * NEEDS MORE PAGES SPIDERED
//. http://www.collectorsguide.com/ab/abmud.html
// - misses jonson gallery address because of no "new mexico" in title
// - misses atomic musuem address for same reason
// - misses "Friday of every month at 1:30pm -- call for reservations"
// because of SEC_HAS_REGISTRATION bit. how to fix?
// - good titles
// - "last modified: September 24, 2007" should be marked as a last mod
// date by Dates.cpp and excluded completely in the min/max event id algo
// * IDENTIFY AND IGNORE LAST MODIFIED/UPDATED DATES AND SECTIONS
// * ADD META DESCRIPTION like we do titles for places to fix jonson gallery,.
//. http://www.abqfolkfest.org/resources.shtml
// - american sewing guild is just in strong tags so is not its own
// sentence, so the title algo breaks down there. but they might have
// just as easily forwent the strong tags, then, how would we get the title?
// i would say this is mostly title-less
// - "For questions or comments contact the webmaster" ???? dunno... SEC_DUP?
// - getting a Last Updated date in the event descriptions too
// - lost a title because of TSF_MIXED_TEXT
// "Tango Club of Albuquerque (Argentine Tango)". should we split up the
// sentence when it ends in a parenthetical to fix that? the new title
// is now "DANCE" which is the generic header.
// * IDENTIFY AND IGNORE LAST MODIFIED/UPDATED DATES AND SECTIONS
//. http://www.unm.edu/~willow/homeless/services.html
// - a bad implied section giving us menu crap for the first few events
// - we get header cruft for every event, so we need implied sections to
// bind the headers to the sections they head. the header are:
// Family Health, Child Care, School Perparation, Food, Fathers,
// Activities. i think they were bound with the font tags which we got
// rid of.
// - for "Tue. - Fri. 9 am. - 11 am" title we are missing the event
// address in the description... what's up with that?
// 101 broadyway does not have address as a title candidate... wtf? was
// that on purpose?? no, the other events have address as title candidates
// misses "Noon Day Ministry" as title...
// - missed "Closed the 1st and 15th of each month;"
// - recognize "(no Thurs)" as except thursday.
// - treat "Fri. pm." as "Friday night"
// - missing "801 mountain" event... why?
// * BETTER IMPLIED SECTIONS
// http://events.mapchannels.com/Index.aspx?venue=628
// - pretty good. has a little menu cruft, but not too bad.
// http://www.salsapower.com/cities/us/newmexico.htm
// - IGNORE WEBMASTER BLURBS (contact webmaster, webmaster/design...)
// - combine copyright, webmaster, advertising blurbs at the end into
// a tail section and ignore...
// - "interested in advertising with us..." part of tail and probably
// would have high SV_DUP score relative to the rest of the scores.
// - getting "Instructores" in description of Cooperage event because
// it is an isolated header with no elements beneath it, other than
// the other header "Santa Fe", which is a header of an implied section.
// i mentioned this below and called it the double header bug.
// * DOUBLE HEADER BUG
// http://www.americantowns.com/nm/albuquerque/events/abq-social-variety-dances-2009-08-22
// - lost event because i guess we added a delimeter-based implied section to
// split the two tod ranges into two different "hard" sections.
// - perhaps not EVERY dance is held at abq sw dance center, so maybe it is
// a good/safe thing that we do not get that event any more.
// - old comments:
// - title is good
// - event description has some menu cruft in it:
// - getting view by date, view by timeframe, view by category list menu
// headers in event description
// - has some real estate agent headers which is not seen as a menu
// header because it only has one link in its menu
// - has navigation links "Add Your <a>business</a> or <a>group</a>" which
// are not 100% in a link, but they are in a list were each item in that
// list does have a link in it, maybe make that exception to the SEC_MENU
// algo, that if the section does contain link text it is acceptable,
// even if it also contains plain text.
// - lone link "See All Cities in New Mexico". how to fix?
// * SUPPORT FOR SINGLE LINK HEADER IDENTIFICATION
// http://www.ceder.net/clubdb/view.php4?action=query&StateId=31
// - titles and descriptions seem pretty good.
// http://www.newmexico.org/calendar/events/index.php?com=detail&eID=9694&year=2009&month=11
// - titles and descriptions seem pretty good.
// http://www.meetup.com/Ballroom-Dance-in-Albuquerque/
// - has a list of languages (language menu)
// - has a trademark blurb "trademarks belong to their respective owners"
// - has a "Read more" link that goes to another page at end of event desc.
// * LANGUAGE MENU
// http://www.abqtango.org/current.html
// - has one bad title because case is bad:
// "Free introductory Argentine Tango dance class" and ends up getting
// less good titles.
// - misses another good title because it has "business district" in
// lower case when it shouldn't really.
// - so we are missing some good titles because of our case penalty...
// perhaps we should not do that if the sentence includes the event date???
// * NO CASE PENALTY IF SENTENCE INCLUDES EVENT DATE
// http://www.sfreporter.com/contact_us/#
// - good title "business hours" now
// - has some menu cruft
// - has a "search" section with a bunch of forms and we get the form
// headers in the event description
// * FORM TAG HEADER DETECTION
// http://pacificmedicalcenters.org/index.php/where-we-are/first-hill/
// - good titles
// - get some doctor's names that were not labeled as SEC_MENU because
// they were by themselves in the list. how to fix?
// * SUPPORT FOR SINGLE LINK HEADER IDENTIFICATION
// http://www.santafeplayhouse.org/onstage.php4
// - bad implied sections for TIcket Price header etc. but we still get the
// correct dates though
// . later we should probably consider doing a larger partition first
// then paritioning those larger sections further. like looking
// ahead a move in a chess game. should better partition
// santafeplayhouse.org methinks this way.
// - give bonus points if implied section ends on a double <br> br tag?
// - bad titles...
// - penalizing "Performance Dates:" because it has a colon, even
// though it is a header for a list of brothers. maybe do not penalize
// under such conditions. this would fix the "pay-what-you-wish" title too!
// - getting bad title "Pay-what-you-wish" which is actually a "price" in
// the ticket prices table. maybe we should penalize event titles in
// registration sections? or treat it as "free" (h_free in Events.cpp)
// so we think of it has another price point. or count it for "dollarCount"
// in Events.cpp.
// * NO HAS_COLON PENALTY if is header of a list of things
// realtor.com
// . both urs have the lat/lon twice, but the first pair misses the negative
// sign in front of the lon and therefore it throws our whole lat/lon algo
// out of sync and we miss the next lat/lon pair which is the real deal
// new event urls to do:
// http://www.weavespindye.org/?loc=8-00-00
// - no tod so no events
// - has no addresses
// - has one iframe, we support it
// http://www.thewoodencow.com/
// - we get store hours as events, but has unrelated events in description
// because it is talking about things going on, but with no dates, and
// only a "read more" link for each thing.
// * REMOVE UNRELATED BLURBS FROM EVENT DESCRIPTIONS ("read more links")
// * REMOVE SINGLE LINKS ("Subscribe (RSS)" link) from desc.
// * REMOVE WEBMASTER BLURB ("Office Space theme by Press75.com") from desc.
// http://www.thewoodencow.com/2010/07/19/a-walk-on-the-wild-side/
// - similar to root url
// * REMOVE SINGLE LINKS ("Subscribe (RSS)" link) from desc.
// * REMOVE WEBMASTER BLURB ("Office Space theme by Press75.com") from desc.
// http://www.adobetheater.org/
// - seems to be ok. got two event dates.
// http://villr.com/market.htm
// . made an exception in isCompatible() so the isolated month/day dates
// can telescope to the store hours dates section even though that section
// has month/day dates already.
// . if later have to undo this fix, then put a fix in that since the section
// has "every saturday" we should ignore its month/day and allow the
// isolated monthdays below to telescope to it. obviously "every saturday"
// is not referring to just one monthday...
// . NEED SUPPORT FOR "mid November"
// . NEEDS SUB-EVENT SUPPORT
// http://blackouttheatre.com/Blackout_Theatre/Upcoming_Productions.html
// . has "the box performance space" but could not find a default venue
// address on the website, and could not link this space to Abq, NM
// . NEED TO IDENTIFY THE DEFAULT CITY/STATE of A WEBSITE (by inlinkers?)
// http://vortexabq.org/
// - pretty hardcore
// - calls javascript to open the real content though and we need to support
// that: http://vortexabq.org/ProdnProcessing.php
// - has "reqa.open("GET","ProdnProcessing.php");" and we need that file
// - misses address: 2004½ Central Ave. SE, Albuquerque, NM 87106
// but might be a copyright address
// * DOWNLOAD JAVASCRIPT IN FUNCTIONS
// * SUPPORT ½ in addresses
// http://folkmads.org/special_events.html
// - misses little sub tod ranges because of the rule:
// "if ( (acc1 & acc2) == acc2 ) return false" because the header date
// itself already has a tod range so it doesn't care about our tod range.
// how to fix?
// - i added an exception at the end of isCompatible() to allow the isolated
// tods to telescope to the July date, but it was causing the pubdate tod for
// piratecatradio.com to telescope to the play time and address, so until we
// somehow are sure the tod is not a pubdate tod we have to leave this out
// - misses location "abq square dance center" has no city/state to pair with
// - we miss o neil's pub why? we can assume new mexico since that is in
// the title. then we need to be able to look up a place name with no
// city and just a state...
// * IF "ABQ" is in PLACE NAME, ASSUME CITY IS ABQ for placedb lookup
// * NEED TO IDENTIFY THE DEFAULT CITY/STATE of A WEBSITE (by inlinkers?)
// http://abqfolkdance.org/
// - misses a few tod range only sub-events because they are in an
// SEC_TOD_EVENT section i guess, or the telescopes fail because of the acc1
// algo... but even if in a separate hard section, we should allow the
// tod range to telescope to saturday nights if our section is only
// tods and tod ranges perhaps???
// "dancing begins at 8:15 and ends around 10:30."
// - the TOD ranges in the second section are sub times of the
// first section, so they should include the first section in their
// event description. we are using his address, right???
// * ADD "ENGLISH" TOD RANGES
// * SUPPORT FOR SUB EVENTS
// * SUPPORT SPECIAL RANGES: "begins around|at 8:15 and ends around|at 10:30"
// http://newmexicojazzfestival.org/
// . is getting the box office hours as events. add to registration keywords.
// * ADD MORE REGISTRATION KEYWORDS
// * SPIDERED DATE is IN JAN 2010
// www.newmexicomusic.org/directory/index.php?content=services&select=529
// . lost event because it is in the same sentence as "box office" because
// the author forgot to put a period in there to separate them into two
// different sentences!
// . "Call the box office for program information: 888.818.7872 or go online
// at www.spencertheater.com Free public tours are offered at 10 a.m. on
// Tuesdays and Thursdays throughout the year."
// * BETTER SENTENCE DETECTION
// http://sybarite5.org/upcoming.htm
// - got "January, October, December 2010" as a header because its datebrother
// bit was not set because it was at the top of the brother list. false
// date header caused us to lose some events.
// - support NYC for address like "338 West 23rd St. NYC"
// - grabbing part of an event description from something that seems like
// it should be paired up with an implied section with the date above it:
// "Piotr Szewczyk The Rebel..." should be paired up with
// "January 22,23 & 24 2010- 8:00pm" or AT LEAST in its own SEC_TOD_EVENT
// section to prevent it from being used as a description for the
// event with the date "July 24th 2010 7:30pm"
// - event description has another brother event desc in it... why? isn't
// the EV_TOD_EVENT working for this???
// - NYC should be recognized sa NY,NY
// * BAD EVENT DESCRIPTION
// * NEEDS MORE IMPLIED SECTIONS
// http://corralesbosquegallery.com/
// - seems to be ok. gets the store hours.
// http://web.mac.com/bdensford/Gallery_website/Events_Calendar.html
// - the above website's events...
// - seems pretty good
// http://villr.com/market.htm
// - event description sentence mess up? "Los Ranchos Growers' and [[]] ..."
// - misses some parts of the event description because of SEC_TOD_EVENT
// section flags. but really the brother sections that caused that were
// actually subevents of the main date, although they did include a
// month and daynum themselves and not a sub tod range as most sub-events
// probably do.
// * SUPPORT FOR SUB EVENTS (month/daynum based)
// http://eventful.com/lawrenceburg/venues/lawrenceburg-fairgrounds-/V0-001-000208596-1
// - has address of lawrenceburg fairgrounds but only as an intersection
// * BETTER INTERSECTION ADDRESSES
// http://rodeo.cincinnati.com/f2/events/proddisplay.aspx?d=&prodid=3461
// - address has no street number "MainStrasse Village, Main Street
// Covington, KY 41011"
// - placedb should index streets without their numbers but with zip codes
// as if they were place names, like "Tom's Grill, Abq NM". but only
// do that if we have a gps point to go with it.
// * INDEX STREET NAMES WITHOUT NUMBERS INTO PLACEDB
// http://www.scrap-ink.com/
// - all flash, can't parse it
// http://www.newmexico.org/calendar/events/index.php?com=detail&eID=9694&year=2009&month=11
// - title of "Cost:" is bad because it preceeds colon -70%
// - best title is "Beginnin Square Dance Lessons, Albuquerque"
// - "disclaimer & use" and "Contact New Mexico TOurism Dept" should be
// part of a menu! wtf? sentence flip flop?
// - we leave out the dollar sign '$' in one of the description sections for
// the cost of the event since the section starts with that!
// - "More details about this meetup" probably a high SV_DUP and since it
// starts with "more" and is in a link, will probably be excluded as a menu
// link
// - sentence flip flop, "Promote!" should be SEC_MENU!
// - "Asst." should be in Abbreviations.h list so that "Asst. Organizers:"
// will be just one section, and will have tiny title and desc. score since
// prceeds a colon.
// - "Trademarks belong to their..." will have high SV_DUP count and therefore
// minimal title and desc. score.
// - language names in a list should have minimal title and desc score.
// but probably no need to detect since SV_DUP will be high eventually.
// * for title score ties prefer one close to the event date with highest
// m_a
// - i would exclude really high SV_DUP dup scores from the title/desc and
// index to keep things clear. but we do want to have field names like
// "Category" that label other non dup-ish content. so labels are ok, but
// not stuff like "More details about this Meetup..." which has a high
// SV_DUP count and is not a field name for anything.
// http://www.sfreporter.com/contact_us/
// - single store hours "event"
// - probably ok but sentence flip flip bug letting in menus?
// http://www.publicbroadcasting.net/kunm/events.eventsmain
// - lost the guild cinema address, but i do not see nm or "new mexico"
// anywhere on the page, so even though albuquerque is right after
// "the guild cinema", if we have no state name, we can't make it work...
// - SUPPORT CITIES WITH NO STATE NAMES SOMEHOW
//mdw left off here do. pacific medical... but fix other bugs first...
// http://www.publicbroadcasting.net/kunm/events.eventsmain?action=showCategoryListing&newSearch=true&categorySearch=4025
// getting bad titles of "Date:"
// need TSF_DATE_SECTION to penalize title score! so when a
// section is a date only, do like x .05
// - need a .90 after colon penalty TSF_AFTER_COLON...
// reverbnation.com:
// this is a toughy!!! we got a lower case title. we have
// mutliple bands which is ok, but we are getting categories
// like "Latin" and "Bogota, CO" as a title. maybe discount
// place names ...
// - for every repeated section tag hash, compute a global
// average title score, and apply that to boost titles that
// might be lower case like "kimo" is on this page. i.e. we
// are voting on the best title sections. and we should also
// use sectiondb for this as well as this local algo.
// - in the case of multiple events
// - if section has a prev or next brother with the same taghash
// then probably give a "list" TLF_IN_LIST penalty for that
// of like maybe .80, not too harsh...
// .. consider comparing content of sections where not any dup/nondup voting
// info, compare to sections on other websites that do have adequate voting
// info, and if similar, maybe use that voting info. might help us nuke
// certain types of footers and headers... legal discalimers, etc. brain
// kinda works like this.
// ** in title tag, allow " - " to split a sentence section
// ** prefer the title that matches a section in the title tag then.
/*
BUT what about burtstikilounge??? all events are lists of links. i guess
then we just need to rely on SEC_NOT_DUP????
well kinda, the whole calendar would have SEC_NOT_DUP, but an individual
cell of the table could have SEC_DUP and/or SEC_NOT_DUP!!
to fix burts: take the list of links that we think is SEC_CRUFT_COMMON
then look that up as a whole section and if SEC_NOT_DUP is set then do
not set SEC_CRUFT on it otherwise set it !!!
does that work?
apply to renegade links as well?
*/
//
// missed events:
//
// http://www.zvents.com/albuquerque-nm/events/show/88688960-sea-the-invalid-mariner
// two of the events now have non-outlinked titles. good. but
// the second date's title is wrong.
// SEA & the Invalid Mariner...
// * EV_OUTLINKED_TITLE casualty
// * BAD TITLE ("Date", ignore <th> tags, SEC_CRUFT_DETECT bit)
// collectorsguide.com:
// special subeevnt at jonson gallery starts at 5:30 but in
// the next sentence, which actually applies to unm art gallery,
// store hours are given up until 4pm, so this cancels out the
// 5:30pm and results in empty times. we could check to see if
// the header is compatible before we add it???
// - use title expansion algo. should be ok since address will
// be included and we should not set EV_OUTLINKED_TITLE.
// * BAD DATE HEADER ALGO
// * BAD TITLES (need full expansion algo)
// abqfolfkest.org
// need to do to-brother title section expansion algo.
// * BAD TITLES (need full expansion algo)
// http://www.guildcinema.com/
// one bad title.
// when scanning to set the title in Events.cpp we start at
// the first date in the telscope, however we should in this
// case start at the 6pm to get the right title. maybe pick
// the date with the highest word # to start at, unless it does
// not have the smallest headerCount (i.e. unless it is used
// in more telescopes as headers than another date)
// - set Date::m_headerCount in Dates.cpp at the end of the algo
// just loop through the dates and set that count for all
// Dates in a telescope not the first ptr.
// - so pick the date in the telescope with the highest m_a
// unless its m_headerCount is not at the min.
// - or would event deduping fix this?
// * BAD TITLES (start scan @ highest m_a,min m_headerCount)
// http://events.mapchannels.com/Index.aspx?venue=628
// using "Buy Tickets from $xx" as titles. i guess we need to
// maybe look at the table column header for "Title"?
// * BAD TITLES (add "Buy Tickets*" links to renegade
// SEC_CRUFT list)
// http://www.salsapower.com/cities/us/newmexico.htm
// one title is "$5.00" withouth the $. maybe stop that.
// skip titles that are just a price.
// allow dates in titles if in same sentence as would be title.
// that should change "with Darrin..." title to
// "Tuesdays with Darrin".
// "Class at" will change to "Class at 7 p.m." but it really
// should be "...The salsa Dance Class at 7 p.m." but i guess
// the br tag is breaking the sentence?? we probably need to
// really improve our sentence detector to fix that right.
// Cooperage event is getting Instructores header as part of
// event description because of their double heading sections.
// FIX by not taking descriptions from brother sections that are
// isolated like that, when you contain its true brother in your
// implied section. it is like a bodyless header brother. do not
// get descriptions from those, maybe unless it is directly above
// you, since it could be a double header, which is rare, but
// that is what it is in this case.
// * BAD TITLES (full expansion algo?)
// * DOUBLE HEADING causing bad heading in event description
// http://www.newmexico.org/calendar/events/index.php?com=detail&eID=9694&year=2009&month=11
// has just one event.
// title we get is "Cost" and is below the date. we really need
// to keep telescoping until we get text above at least one of
// the dates in the telescope... so if we discover we have a bad
// title then telescope until we got text on top of the lowest
// date. try to first get the title before the date. if we
// telescope up until we get text before the date, if all the
// new section we get before the date is just a title section
// looking thing (ignoring the SEC_CRUFT) then maybe that is
// the best title.
// * BAD TITLES (telescope until text above the date???)
// http://www.patpendergrass.com/albnews.html
// "saturday morning from 10:00 am - noon" is not telescoping
// to "March 19, 2005" like it should... wtf?
// * BAD DATE TELESCOPING
// http://www.abqtango.org/current.html
// one title is "New" so we should ignore that probably.
// * BAD TITLES (need full expansion algo probably for others)
// http://pacificmedicalcenters.org/index.php/where-we-are/first-hill/
// gets a couple titles wrong. full expansion would fix it.
// * BAD TITLES (need full expansion algo probably for others)
// http://www.santafeplayhouse.org/onstage.php4
// we do not realize that all these dates are talking about
// one event really... so titles are not the best...
// also do not parse an except/closed date correctly...
// * BAD TITLES (???)
// http://www.publicbroadcasting.net/kunm/events.eventsmain?action=showCategoryListing&newSearch=true&categorySearch=4025
// getting bad titles of "Date:" can be fixed with full exp algo.
// * BAD TITLES (need full expansion algo)
// http://www.dailylobo.com/calendar/
// bad title. only one event so can't maybe do full exp algo.
// Title is "Offered".
// * BAD TITLE (???)
// http://www.burtstikilounge.com/burts/
// there really are no titles.
// so we would just take the first item in a calendar day and
// ignoring dates would find the title to be outlinked which
// is probably a good thing.
// however the store hours do not really have brothers so
// maybe do not do full expansion on them???
// do not do the full expansion if we have a calendar page like
// this because there are often multiple events per daynum...
// * BAD TITLES (ignore daynums,...???)
// http://upcoming.yahoo.com/event/4888173/NM/Albuquerque/Pet-Loss-Group/The-Source/
// single event. bad title of "Event Photos" which is really
// SEC_CRUFT but we do not know it yet.
// * BAD TITLE (???)
// http://events.kqed.org/events/index.php?com=detail&eID=9812&year=2009&month=11
// has dup event. but really just one event. title is
// "Cost:" which is wrong, and the true title is above the date.
// consider telescoping until we get text above the date.
// * BAD TITLE (telescope until text above the date?)
// http://entertainment.signonsandiego.com/events/eve-selis/
// single event.
// has title "When". really we need to identify and ignore
// the menu cruft better.
// * BAD_TITLE (title is "When", telescope til text above)
// http://www.mrmovietimes.com/movie-theaters/Century-Rio-24.html
// a title is bad, it is now the address of the place.
// lost all events because their movie titles were outlinked.
// but the movie "2012" survived because its title was bypassed
// because D_IS_IN_DATE was set for it!
// - try to fix with another site page to set SEC_NOT_MENU
// * EV_OUTLINKED_TITLE casualty
// * BAD_TITLE ("2012" [the movie])
// http://www.trumba.com/calendars/KRQE_Calendar.rss
// - missed address "12611 Montgomery Blvd. NE, Suite A-4 in the
// Glenwood Shopping Center" because city is not after or before it,
// and i guess before when we did get this address, we had contact info
// or something in abq. now i don't see contact info or a venue addr for
// trumba, which is right...
// - missed "Each weekly program is offered on Sunday at 10:30am with a
// repeat on Wednesday at 6:00pm". was only getting them right before
// we added the comboTable logic in Dates.cpp to get all date combos,
// because of a fluke. really if Sunday and Wednesday were modified
// by "every" or were plural then they would not be allowed to telescope
// to the daynum/month date, which is causing them to be emptytimes.
// - the other trumba.com url i think has a similar issue for the
// "Transitioning Professionals..." events, which have meetings every
// Tuesday, but the "every" is not right before the Tuesday, so we miss
// that too. better safe than sorry!
// mdw left off here
// http://boe.sandovalcountynm.gov/location.html
// missing address:
// "960 FORREST RD 10 JEMEZ SPRINGS, NM 87025"
// http://www.uniquevenues.com/StJohnsNM
// missing address:
// "Colorado Office: 225 Main St, Opal Bldg, G-1 Edwards, CO"
// does not like the "suite" in between street and city.
// http://eventful.com/albuquerque/venues/the-filling-station-/V0-001-001121221-1
// before was protected by SEC_NOT_MENU logic, but now we had to
// remove that since SEC_NOT_MENU logic is not reliable.
// * EV_OUTLINKED_TITLE casualty
// http://events.kgoradio.com/san-francisco-ca/venues/show/4834-davies-symphony-hall
// really it is getting bad titles now and should not have
// any events since they are all outlinked titles...
// * EV_OUTLINKED_TITLE casualty
// * BAD TITLES ("Hide")
// http://www.zvents.com/albuquerque-nm/venues/show/11865-kimo-theatre
// this lost all its events except the store hours, which is
// expected behavior now.
// * EV_OUTLINKED_TITLE casualty
// http://www.when.com/albuquerque-nm/venues
// all its events had outlinked titles and it lost them all. good.
// * EV_OUTLINKED_TITLE casualty
// http://events.kgoradio.com/san-francisco-ca/events/show/88047269-san-francisco-symphony-chorus-sings-bachs-christmas-oratorio
// two of the events now have non-outlinked titles. good. but
// the second and third dates' titles are wrong.
// * EV_OUTLINKED_TITLE casualty
// http://events.sfgate.com/san-francisco-ca/venues/show/6136-exploratorium
// all its events had outlinked titles and it lost them all. good.
// * EV_OUTLINKED_TITLE casualty
// http://events.sfgate.com/san-francisco-ca/events/show/88884664-solstice-seed-swap
// all of its events but one were lost because of outlinked title.
// this is good.
// * EV_OUTLINKED_TITLE casualty
// http://www.when.com/albuquerque-nm/venues/show/1061223-guild-cinema
// all its events had outlinked titles and it lost them all. good.
// * EV_OUTLINKED_TITLE casualty
// http://www.reverbnation.com/venue/153991