Monday, March 19, 2012

Creating a range lookup table from a file of millions of rows

I have a file with the item id and the item type. The data looks as follows
:
ItemID ItemType
1 A
2 A
3 A
4 B
5 B
6 C
7 C
8 A
9 A
I want to create a lookup tables as follows:
Start End ItemType
1 3 A
4 5 B
6 7 C
8 9 A
Please keep in mind the file I have the ids on is millions of rows. Also,
there are gaps in the ids, (i.e. may jump from 4 to 6 no 5). Gaps are
acceptable as long as they are not too large.
Thanks in advance for any tips you can provide.I'm not sure what you mean by "gaps are acceptable as long as they are
not too large". Apparently no gaps in your sample data anyway. See if
this meets your requirements:
SELECT MIN(itemid), MAX(itemid), itemtype
FROM
(SELECT T1.itemid, T1.itemtype,
MIN(T2.itemid) AS x_itemid
FROM tbl AS T1
LEFT JOIN tbl AS T2
ON T1.itemtype <> T2.itemtype
AND T1.itemid < T2.itemid
GROUP BY T1.itemid, T1.itemtype) AS T
GROUP BY itemtype, x_itemid
If performance is an issue then you could do this for smaller subsets
of rows and then combine the results.
David Portas
SQL Server MVP
--|||David
Can I ask you, why did you join the table?
create table #test
(
itemid int not null primary key,
itemtype char(1) not null
)
insert into #test values (1,'a')
insert into #test values (2,'a')
insert into #test values (3,'a')
insert into #test values (4,'b')
insert into #test values (5,'b')
insert into #test values (6,'c')
insert into #test values (7,'c')
insert into #test values (8,'d')
select min(itemid),max(itemid),itemtype
from #test group by itemtype
What is differ between these queries?
"David Portas" <REMOVE_BEFORE_REPLYING_dportas@.acm.org> wrote in message
news:1123000386.349876.22890@.g47g2000cwa.googlegroups.com...
> I'm not sure what you mean by "gaps are acceptable as long as they are
> not too large". Apparently no gaps in your sample data anyway. See if
> this meets your requirements:
> SELECT MIN(itemid), MAX(itemid), itemtype
> FROM
> (SELECT T1.itemid, T1.itemtype,
> MIN(T2.itemid) AS x_itemid
> FROM tbl AS T1
> LEFT JOIN tbl AS T2
> ON T1.itemtype <> T2.itemtype
> AND T1.itemid < T2.itemid
> GROUP BY T1.itemid, T1.itemtype) AS T
> GROUP BY itemtype, x_itemid
> If performance is an issue then you could do this for smaller subsets
> of rows and then combine the results.
> --
> David Portas
> SQL Server MVP
> --
>|||Hi Uri,
Replace
insert into #test values (8,'d')
with
insert into #test values (8,'a')
and see the difference
With warm regards
Jatinder Singh|||So David's script gave me a wrong output.
"jsfromynr" <jatinder.singh@.clovertechnologies.com> wrote in message
news:1123050090.479481.14310@.z14g2000cwz.googlegroups.com...
> Hi Uri,
> Replace
> insert into #test values (8,'d')
> with
> insert into #test values (8,'a')
> and see the difference
> With warm regards
> Jatinder Singh
>|||Hi Uri,
It gave correct ouput to me.
Start End ItemType
8 9 A -- (1)
1 3 A -- (2)
4 5 B
6 7 C
The only thing is (1) appears at top which can be easily adjusted by
using ored by clause
With warm regards
Jatinder Singh|||Hi
Should not be 1 for MIN and 9 for MAX for A?
"jsfromynr" <jatinder.singh@.clovertechnologies.com> wrote in message
news:1123056485.365685.300130@.g14g2000cwa.googlegroups.com...
> Hi Uri,
> It gave correct ouput to me.
> Start End ItemType
> 8 9 A -- (1)
> 1 3 A -- (2)
> 4 5 B
> 6 7 C
> The only thing is (1) appears at top which can be easily adjusted by
> using ored by clause
> With warm regards
> Jatinder Singh
>|||Hi Uri,
Again Let us see this
ItemID ItemType
1 A -- *
2 A -- * One Group with ItemType='a' Here
min(itemid)= 1 and max is 3
3 A -- * 1 3 A (One Row of Reuired Result)
4 B -- ^ Another Group with ItemType='b' Here
min(itemid)= 4 and max is 5
5 B -- ^ 4 5 B (Another Row of Reuired Result)
6 C -- ~Another Group with ItemType='c' Here
min(itemid)= 6 and max is 7 7 C -- 6 7 C
(Another Row of Reuired Result)
8 A -- Again 'A' is repeated but there is gap so it
is to be considerd as a
9 A -- Fresh Group
-- 8 9 A
So the resultant output produced by David's Query is Correct
Start End ItemType
1 3 A
4 5 B
6 7 C
8 9 A
I hope it made the author's requirements more clear.
With warm regards
Jatinder Singh|||The difference is that your query only gives one row per ItemType
rather than one row per contiguous sequence on ItemType. I call my
query a "condensed" or "stepped" sequence rather than an aggregation.
The point is that it shows the regions or periods over which a
particuar ItemType applies. In my interpretation that's what BTJ was
asking for.
David Portas
SQL Server MVP
--|||David
Thanks, I got it
"David Portas" <REMOVE_BEFORE_REPLYING_dportas@.acm.org> wrote in message
news:1123058956.842173.89690@.o13g2000cwo.googlegroups.com...
> The difference is that your query only gives one row per ItemType
> rather than one row per contiguous sequence on ItemType. I call my
> query a "condensed" or "stepped" sequence rather than an aggregation.
> The point is that it shows the regions or periods over which a
> particuar ItemType applies. In my interpretation that's what BTJ was
> asking for.
> --
> David Portas
> SQL Server MVP
> --
>

No comments:

Post a Comment