Showing posts with label posting. Show all posts
Showing posts with label posting. Show all posts

Friday, March 30, 2012

More on querying remote server

i'll just keep posting the problems, as my own archive.
SELECT TransactionNumber
FROM servertest.CMSArchiveTraining.dbo.Transactions_90
WHERE transactionGUID = '4956E240-8B6E-437E-B9B0-5E83FC25E3F0'
Issues the remote query:
SOURCE:(servertest),
QUERY:(
SELECT Tbl1001."TransactionGUID"
Col1003,Tbl1001."TransactionNumber" Col1004
FROM "CMSArchiveTraining"."dbo"."Transactions_90" Tbl1001
)
And tries to bring back 8.5M rows - and takes minutes to return, when really
there is only one row.
Question #1: Why doesn't it include the WHERE clause in the remote query?
If i change the query to:
SELECT TOP 1 TransactionNumber
FROM servertest.CMSArchiveTraining.dbo.Transactions_90
WHERE transactionGUID = '4956E240-8B6E-437E-B9B0-5E83FC25E3F0'
It still issues the remote query:
SOURCE:(servertest),
QUERY:(
SELECT
Tbl1001."TransactionGUID" Col1003,
Tbl1001."TransactionNumber" Col1004
FROM "CMSArchiveTraining"."dbo"."Transactions_90" Tbl1001
)
But this time it only brings over 10 rows (somehow), performs a top 1
filter, and returns instantly.
Question #2: How is it bringing over 10 rows only? (hint: cursor)
If i change the query to:
SELECT TransactionNumber
FROM servertest.CMSArchiveTraining.dbo.Transactions_90
WHERE TransactionNumber = 9679
It issues the remote query:
SOURCE:(servertest),
QUERY:(
SELECT
Tbl1001."TransactionNumber" Col1004
FROM "CMSArchiveTraining"."dbo"."Transactions_90" Tbl1001
WHERE Tbl1001."TransactionNumber"=(9679)
)
which returns only one row, and returns instantly. Obviously SQL Server can
*sometimes* do the optimization, othertimes it won't.
Question #3: Why does it include the WHERE clause in the remote query.
If i change the query to:
SELECT TransactionNumber
FROM servertest.CMSArchiveTraining.dbo.Transactions_90
WHERE TransactionDate = '2002-06-10 08:19:10.513'
It issues the remote query:
SOURCE:(servertest),
QUERY:(
SELECT
Tbl1001."TransactionNumber" Col1004
FROM "cmsarchivetraining"."dbo"."Transactions_90" Tbl1001
WHERE Tbl1001."TransactionDate"='2002-06-10T08:19:10.513'
)
Question #4: Why does it include the WHERE clause in the remote query?
Now i create a view:
CREATE VIEW CMSArchiveTranasctions AS
SELECT *
FROM SERVERTEST.CMSArchiveTraining.dbo.Transactions_90
and issue the query:
SELECT TransactionNumber
FROM CMSArchiveTransactions
WHERE TransactionNumber = 9679
It wants to issues the remote query:
SOURCE:(SERVERTEST),
QUERY:(
SELECT
Tbl1001."TransactionNumber" Col1005
FROM "CMSArchiveTraining"."dbo"."Transactions_90" Tbl1001
)
And tries to bring back 8.5M rows - and takes minutes to return, when really
there is only one row.
Question #5: Why does it not include the WHERE clause in the remote query?
If i change the query to:
SELECT TOP 1 TransactionNumber
FROM CMSArchiveTransactions
WHERE TransactionNumber = 9679
it issues the remote query:
SOURCE:(SERVERTEST),
QUERY:(
SELECT TOP 1 Col1004
FROM (
SELECT
Tbl1001."TransactionNumber" Col1004
FROM "CMSArchiveTraining"."dbo"."Transactions_90" Tbl1001
WHERE Tbl1001."TransactionNumber"=(9679)
) Qry1018
)
Question #6: Why is it now including the TOP inside the remote query, when
my earlier issue of a TOP query didn't include the TOP limiter?
Question #7: Why is it now including the WHERE clause in the remote query,
when not including a TOP 1 it won't include the where clause limiter?
Question #8: Why does it include the WHERE clause when i have any TOP
limiter (even TOP 999999999, or TOP 100 PERCENT), but does not include the
where clause when i don't?
So i try changing my query to:
SELECT TransactionNumber
FROM (
SELECT *
FROM SERVERTEST.CMSArchiveTraining.dbo.Transactions_90) t
WHERE TransactionNumber = 9679
And it issues the remote query:
SOURCE:(SERVERTEST),
QUERY:(
SELECT
Tbl1001."TransactionNumber" Col1004
FROM "CMSArchiveTraining"."dbo"."Transactions_90" Tbl1001
WHERE Tbl1001."TransactionNumber"=(9679)
)
So querying through a view it will not include the WHERE clause, but if i
include the view as a derived table, it can include the WHERE clause.
Question #9: Why does it include the WHERE clause when i query through a
derived table, and not through a view?
So i try changing the query to
SELECT TransactionNumber
FROM (
SELECT *
FROM CMSArchiveTransactions) t
WHERE TransactionNumber = 9679
So here we are, pay attention to this one. If i attempt to query the view
directly, it doesn't get optimized. Now i am going to query my view THROUGH
a derived table. What do you think it will do?
SOURCE:(SERVERTEST),
QUERY:(
SELECT Tbl1001."TransactionNumber" Col1004
FROM "CMSArchiveTraining"."dbo"."Transactions_90" Tbl1001
WHERE Tbl1001."TransactionNumber"=(9679)
)
It in fact now DOES include the where clause, but if i try to query my view
directly, it can't optimize it. This is a stunning development.
It means the rules are random, which makes it difficult to optimize
distributed queries.
Rules for including where clause:
Filter on uniqueidentifer:
Query includes TOP 1: Yes*
Query includes TOP 2: No
Query includes TOP n: No
Query includes TOP n PERCENT: No
Query doesn't include TOP: No
Filter on integer:
Query linked SQL Server: Yes
Query view that queries linked SQL Server:
Query view directly:
Query includes TOP 1: Yes
Query includes TOP 2: Yes
Query includes TOP 100: Yes
Query includes TOP 999999999: Yes
Query includes TOP 100 PERCENT: Yes
Query doesn't include TOP: No
Query view through derived table:
Query includes TOP 1: Yes
Query includes: TOP 2: Yes
Query includes: TOP n: Yes
Query includes: TOP 100 PERCENT: Yes
Query includes TOP: YesYou haven't posted any DDL. Are there any indexes on the remote table? Which
columns are indexed?
ML
http://milambda.blogspot.com/|||"ML" <ML@.discussions.microsoft.com> wrote in message
news:E89D60ED-EAEF-46F9-9F31-1D56842B277E@.microsoft.com...
> You haven't posted any DDL. Are there any indexes on the remote table?
> Which
> columns are indexed?
Before answering that, let me ask you this: why does it matter? It doesn't
seem to matter *sometimes*. And even if there aren't an indexes on the
remote tables, everyone can agree that sending any filtering criteria to a
remote server is faster than trying to filter rows after they've crossed a
link.
Even a table scan on a remote server is better than bringing the rows over a
network to be filtered.
Additionally, my post is not dealing with query performance. When SQL Server
does what it should do, the queries run fine. If SQL Server doesn't do what
it's supposed to do, the queries do not run fine. Indexes are not the
performance limitation here.
That having been said: yes.|||If you think you know your data better than the optimizer does (as might be
the case - especially with outdated statistics), you could use the REMOTE
join hint to tell the optimizer to process the join on the remote server.
Basically, the optimizer can only rely on statistics. If they are missing on
the remote site, he prefers "staying local" although filtering data remotely
might have yielded better performance. Why put a stress on a remote server i
f
poor performance is expected anyway?
But when statistics are up to date on both sites, then the optimizer still
has full control over the execution of queries (unless specified otherwise
through the use of hints) which basically means that he can still make the
wrong assumptions - and I'm as much in the dark here as you are.
ML
http://milambda.blogspot.com/|||> Basically, the optimizer can only rely on statistics. If they are missing
> on
> the remote site, he prefers "staying local" although filtering data
> remotely
> might have yielded better performance. Why put a stress on a remote server
> if
> poor performance is expected anyway?
i guess this is where common sense gets to take a back seat.

> But when statistics are up to date on both sites, then the optimizer still
> has full control over the execution of queries (unless specified otherwise
> through the use of hints) which basically means that he can still make the
> wrong assumptions - and I'm as much in the dark here as you are.
Is there anyone who could explain the bewildering optimizer choices?|||Can anyone
ever imagine
any situation
under any circumstances
in any manner
in any capacity
in any way
with any product
running on any kind of data link
fast or slow
on any database setup
existing in any universe
moving at any velocity
at any time since the big bang itself
where it would be better to filter rows after fetching them?|||Here is some SQL to create a table named "Transactions", and will contain
some fields including "TransactionDate", "TransactionNumber",
"TransactionGUID"
USE pubs
go
DROP TABLE Transactions
go
CREATE Table Transactions (
[TransactionID] [int] NOT NULL IDENTITY,
TransactionGUID uniqueidentifier NOT NULL default newid(),
[title] [varchar] (80) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[type] [char] (12) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[pub_id] [char] (4) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[price] [money] NULL ,
[advance] [money] NULL ,
[royalty] [int] NULL ,
[ytd_sales] [int] NULL ,
[notes] [varchar] (200) NULL ,
[TransactionDate] [datetime] NOT NULL DEFAULT (getdate())
)
go
INSERT INTO Transactions (title, type, pub_id, price, advance, royalty,
ytd_sales, notes, TransactionDate)
SELECT
[title],
[type],
[pub_id],
[price],
[advance],
[royalty],
[ytd_sales],
[notes],
CAST(CAST([pubdate] as real) + RAND(3234)*2000 - 1000 AS datetime)
FROM Titles
CROSS JOIN (
SELECT (a.Number * 256) + b.Number AS Number
FROM master..spt_values a,
master..spt_values b
WHERE a.Type = 'p'
AND b.Type = 'p') numbers
You'll have to create the linked server yourself, you'll need two servers.
You'll also have to substitute your own values for TransactionGUID,
TransactionDate, TransactionNumber|||I would hazard a guess that the local SQL server has no knowledge of the
statistics on the remote server and can only determine which query to send
across to the remote server based on the existence of keys (possibly indexes
or unique indexes).
The question would be what indexes exist on the remote table. I believe if
you look at these closely your results will turn out to be much more
consistent than you think.
"ML" <ML@.discussions.microsoft.com> wrote in message
news:064A71B2-9B67-4F97-8635-6854A4B9551C@.microsoft.com...
> If you think you know your data better than the optimizer does (as might
be
> the case - especially with outdated statistics), you could use the REMOTE
> join hint to tell the optimizer to process the join on the remote server.
> Basically, the optimizer can only rely on statistics. If they are missing
on
> the remote site, he prefers "staying local" although filtering data
remotely
> might have yielded better performance. Why put a stress on a remote server
if
> poor performance is expected anyway?
> But when statistics are up to date on both sites, then the optimizer still
> has full control over the execution of queries (unless specified otherwise
> through the use of hints) which basically means that he can still make the
> wrong assumptions - and I'm as much in the dark here as you are.
>
> ML
> --
> http://milambda.blogspot.com/|||I am guessing to a fair extent here, but some possible (and I think likely)
explanations for this behavior...
The main thing is, I think, that SQL Server has no statistics available when
accessing a remote database and makes a best guess based on the existence of
keys.
Incidently, when working with Oracle 8 I found similar issues with their
"database links".

>Question #1: Why doesn't it include the WHERE clause in the remote query?
No Index on transactionGUID?

>Question #2: How is it bringing over 10 rows only? (hint: cursor)
10 rows is the size of the chunks retrieved. The top 1 is in the first ten
rows, so SQL Server doesnt bother retrieving any further data from the
remote database.

>Question #3: Why does it include the WHERE clause in the remote query.
Index or PK on TransactionNumber?

>Question #4: Why does it include the WHERE clause in the remote query?
Index or PK on TransactionDate?

>Question #5: Why does it not include the WHERE clause in the remote query?
Entire view is retrieved. The optimizer will sometimes do this with local
views as well, depending on how they are used.

>Question #6: Why is it now including the TOP inside the remote query, when
>my earlier issue of a TOP query didn't include the TOP limiter?
>Question #7: Why is it now including the WHERE clause in the remote query,
>when not including a TOP 1 it won't include the where clause limiter?
>Question #8: Why does it include the WHERE clause when i have any TOP
>limiter (even TOP 999999999, or TOP 100 PERCENT), but does not include the
>where clause when i don't?
Top says return only X rows from the view. Because it needs to order the
results, a different execution plan is returned. This will happen with
local views also, depending on statistics and how they are used.

>Question #9: Why does it include the WHERE clause when i query through a
>derived table, and not through a view?
The local SQL server is rewriting your select to directly access the table.
This particular SQL may look different to a human, but logically, and to SQL
Server, it is identical to your 3rd example. This is perfectly consistent.
"Ian Boyd" <ian.msnews010@.avatopia.com> wrote in message
news:O$7N6IZWGHA.4424@.TK2MSFTNGP05.phx.gbl...
> i'll just keep posting the problems, as my own archive.
> SELECT TransactionNumber
> FROM servertest.CMSArchiveTraining.dbo.Transactions_90
> WHERE transactionGUID = '4956E240-8B6E-437E-B9B0-5E83FC25E3F0'
> Issues the remote query:
> SOURCE:(servertest),
> QUERY:(
> SELECT Tbl1001."TransactionGUID"
> Col1003,Tbl1001."TransactionNumber" Col1004
> FROM "CMSArchiveTraining"."dbo"."Transactions_90" Tbl1001
> )
> And tries to bring back 8.5M rows - and takes minutes to return, when
really
> there is only one row.
> Question #1: Why doesn't it include the WHERE clause in the remote query?
> If i change the query to:
> SELECT TOP 1 TransactionNumber
> FROM servertest.CMSArchiveTraining.dbo.Transactions_90
> WHERE transactionGUID = '4956E240-8B6E-437E-B9B0-5E83FC25E3F0'
> It still issues the remote query:
> SOURCE:(servertest),
> QUERY:(
> SELECT
> Tbl1001."TransactionGUID" Col1003,
> Tbl1001."TransactionNumber" Col1004
> FROM "CMSArchiveTraining"."dbo"."Transactions_90" Tbl1001
> )
> But this time it only brings over 10 rows (somehow), performs a top 1
> filter, and returns instantly.
> Question #2: How is it bringing over 10 rows only? (hint: cursor)
> If i change the query to:
> SELECT TransactionNumber
> FROM servertest.CMSArchiveTraining.dbo.Transactions_90
> WHERE TransactionNumber = 9679
> It issues the remote query:
> SOURCE:(servertest),
> QUERY:(
> SELECT
> Tbl1001."TransactionNumber" Col1004
> FROM "CMSArchiveTraining"."dbo"."Transactions_90" Tbl1001
> WHERE Tbl1001."TransactionNumber"=(9679)
> )
> which returns only one row, and returns instantly. Obviously SQL Server
can
> *sometimes* do the optimization, othertimes it won't.
> Question #3: Why does it include the WHERE clause in the remote query.
> If i change the query to:
> SELECT TransactionNumber
> FROM servertest.CMSArchiveTraining.dbo.Transactions_90
> WHERE TransactionDate = '2002-06-10 08:19:10.513'
> It issues the remote query:
> SOURCE:(servertest),
> QUERY:(
> SELECT
> Tbl1001."TransactionNumber" Col1004
> FROM "cmsarchivetraining"."dbo"."Transactions_90" Tbl1001
> WHERE Tbl1001."TransactionDate"='2002-06-10T08:19:10.513'
> )
> Question #4: Why does it include the WHERE clause in the remote query?
> Now i create a view:
> CREATE VIEW CMSArchiveTranasctions AS
> SELECT *
> FROM SERVERTEST.CMSArchiveTraining.dbo.Transactions_90
> and issue the query:
> SELECT TransactionNumber
> FROM CMSArchiveTransactions
> WHERE TransactionNumber = 9679
> It wants to issues the remote query:
> SOURCE:(SERVERTEST),
> QUERY:(
> SELECT
> Tbl1001."TransactionNumber" Col1005
> FROM "CMSArchiveTraining"."dbo"."Transactions_90" Tbl1001
> )
> And tries to bring back 8.5M rows - and takes minutes to return, when
really
> there is only one row.
> Question #5: Why does it not include the WHERE clause in the remote query?
> If i change the query to:
> SELECT TOP 1 TransactionNumber
> FROM CMSArchiveTransactions
> WHERE TransactionNumber = 9679
> it issues the remote query:
> SOURCE:(SERVERTEST),
> QUERY:(
> SELECT TOP 1 Col1004
> FROM (
> SELECT
> Tbl1001."TransactionNumber" Col1004
> FROM "CMSArchiveTraining"."dbo"."Transactions_90" Tbl1001
> WHERE Tbl1001."TransactionNumber"=(9679)
> ) Qry1018
> )
> Question #6: Why is it now including the TOP inside the remote query, when
> my earlier issue of a TOP query didn't include the TOP limiter?
> Question #7: Why is it now including the WHERE clause in the remote query,
> when not including a TOP 1 it won't include the where clause limiter?
> Question #8: Why does it include the WHERE clause when i have any TOP
> limiter (even TOP 999999999, or TOP 100 PERCENT), but does not include the
> where clause when i don't?
> So i try changing my query to:
> SELECT TransactionNumber
> FROM (
> SELECT *
> FROM SERVERTEST.CMSArchiveTraining.dbo.Transactions_90) t
> WHERE TransactionNumber = 9679
> And it issues the remote query:
> SOURCE:(SERVERTEST),
> QUERY:(
> SELECT
> Tbl1001."TransactionNumber" Col1004
> FROM "CMSArchiveTraining"."dbo"."Transactions_90" Tbl1001
> WHERE Tbl1001."TransactionNumber"=(9679)
> )
> So querying through a view it will not include the WHERE clause, but if i
> include the view as a derived table, it can include the WHERE clause.
> Question #9: Why does it include the WHERE clause when i query through a
> derived table, and not through a view?
> So i try changing the query to
> SELECT TransactionNumber
> FROM (
> SELECT *
> FROM CMSArchiveTransactions) t
> WHERE TransactionNumber = 9679
> So here we are, pay attention to this one. If i attempt to query the view
> directly, it doesn't get optimized. Now i am going to query my view
THROUGH
> a derived table. What do you think it will do?
> SOURCE:(SERVERTEST),
> QUERY:(
> SELECT Tbl1001."TransactionNumber" Col1004
> FROM "CMSArchiveTraining"."dbo"."Transactions_90" Tbl1001
> WHERE Tbl1001."TransactionNumber"=(9679)
> )
> It in fact now DOES include the where clause, but if i try to query my
view
> directly, it can't optimize it. This is a stunning development.
> It means the rules are random, which makes it difficult to optimize
> distributed queries.
>
> Rules for including where clause:
> Filter on uniqueidentifer:
> Query includes TOP 1: Yes*
> Query includes TOP 2: No
> Query includes TOP n: No
> Query includes TOP n PERCENT: No
> Query doesn't include TOP: No
> Filter on integer:
> Query linked SQL Server: Yes
> Query view that queries linked SQL Server:
> Query view directly:
> Query includes TOP 1: Yes
> Query includes TOP 2: Yes
> Query includes TOP 100: Yes
> Query includes TOP 999999999: Yes
> Query includes TOP 100 PERCENT: Yes
> Query doesn't include TOP: No
> Query view through derived table:
> Query includes TOP 1: Yes
> Query includes: TOP 2: Yes
> Query includes: TOP n: Yes
> Query includes: TOP 100 PERCENT: Yes
> Query includes TOP: Yes
>|||Absolutely. The same way there are cases when a full table scan is better
than using an index.
If you have 1,000 rows in a table and you add a filter on and indexed field
that returns 999 records, then SQL server will do a full table scan instead
of using the index.
If you have 1,000 rows in a remote table and you add a filter that returns
999 records, let the database that is actually going to use the data spend
the cycles filtering it. Let the remote server just do a quick IO and send
the data over the network with minimal CPU usage.
Now, if the local server has statistics on the tables in the remote
database, then it could make much better decisions regarding when to send
the filter and when to not. Because it doesn't know the remote database, it
has no idea how many rows will be returned (unless querying on a key field
maybe?) and makes a (rather useless) judgement call.
"Ian Boyd" <ian.msnews010@.avatopia.com> wrote in message
news:%23fsEP7ZWGHA.2064@.TK2MSFTNGP03.phx.gbl...
> Can anyone
> ever imagine
> any situation
> under any circumstances
> in any manner
> in any capacity
> in any way
> with any product
> running on any kind of data link
> fast or slow
> on any database setup
> existing in any universe
> moving at any velocity
> at any time since the big bang itself
> where it would be better to filter rows after fetching them?
>sql

Wednesday, March 28, 2012

More logical drives VS more striped drives

A user called ZoomZoom made the posting below on Tech Republic. As it did
not get a reply there I am interested to know what the community's opinion
is.
Some of my Raid 10's are made up of 8 drives. Would it have been better to
have created 2 Raid 10's of 4 drives and then have two data files in the
file group?
I would expect an 8 drive Raid 10 to give a higher sequential data rate but
what about random IO when one has many users?
Regard
Paul Cahill
...
Posted by ZoomZoom
I have seen many articles that say to place log files separate from db
files, and use raid 1+0 for the db... but I haven't found anything to answer
this question for me though (unless it should be so obvious that I'm just
not seeing it).
Is it faster for the database to have the tables split into separate files
and put each file in smaller raid 1+0 configurations or is it faster to use
the same number of drives in a single raid 1+0. For example: 8 drives could
be split into 2 raid 1+0 arrays (4 each) or they could be configured in one
large raid 1+0 array (8 drives). The end result would be striping between 2
sets of 2 drives (in the 4 disk array due to 2 being only mirrors) or
striping between 4 drives (in the 8 disk array).
I guess another way of looking at the question is how much performance does
each additional pair of disks in a raid 1+0 array really add. Can an 8 disk
array handle twice as many IO's as as a 4 disk array? Or does it only add
maybe 30% more IO ability? If the later, would it make sense to add blocks
of 4 disk arrays and split the database tables into multiple files instead?
Would this theory work like putting the logs on separate drives from the
database files?
I realize it's an expensive solution... but with the price of database
per-processor licenses being what they are... it makes sense to try to
squeeze as much performance out of your database server as possible.I have thought about it many times.
When using more disks in one volume, random IO rises, but not by the same
factor as number of disks.
If load of these two files is the same, then two volumes are better, but if
it is not, one volume is better.
Are you sure that load is always uniformly distributed between files? Than
use separate volumes.
In real world the load always fluctuates so it is not possible to make a
common conclusion.
"Paul Cahill" <anon@.anon.com> píse v diskusním príspevku
news:%237m9RqZuHHA.3368@.TK2MSFTNGP02.phx.gbl...
>A user called ZoomZoom made the posting below on Tech Republic. As it did
>not get a reply there I am interested to know what the community's opinion
>is.
> Some of my Raid 10's are made up of 8 drives. Would it have been better to
> have created 2 Raid 10's of 4 drives and then have two data files in the
> file group?
> I would expect an 8 drive Raid 10 to give a higher sequential data rate
> but what about random IO when one has many users?
> Regard
> Paul Cahill
> ...
> Posted by ZoomZoom
>
> I have seen many articles that say to place log files separate from db
> files, and use raid 1+0 for the db... but I haven't found anything to
> answer this question for me though (unless it should be so obvious that
> I'm just not seeing it).
> Is it faster for the database to have the tables split into separate files
> and put each file in smaller raid 1+0 configurations or is it faster to
> use the same number of drives in a single raid 1+0. For example: 8 drives
> could be split into 2 raid 1+0 arrays (4 each) or they could be configured
> in one large raid 1+0 array (8 drives). The end result would be striping
> between 2 sets of 2 drives (in the 4 disk array due to 2 being only
> mirrors) or striping between 4 drives (in the 8 disk array).
> I guess another way of looking at the question is how much performance
> does each additional pair of disks in a raid 1+0 array really add. Can an
> 8 disk array handle twice as many IO's as as a 4 disk array? Or does it
> only add maybe 30% more IO ability? If the later, would it make sense to
> add blocks of 4 disk arrays and split the database tables into multiple
> files instead? Would this theory work like putting the logs on separate
> drives from the database files?
> I realize it's an expensive solution... but with the price of database
> per-processor licenses being what they are... it makes sense to try to
> squeeze as much performance out of your database server as possible.
>|||I pretty much agree with that explanation. If you know exactly how your
files will be accessed and they compete with each other at a fairly
intensive level you might get better performance from splitting them. This
is true of separating tempdb from data files as well. But if you don't know
or the load is not too high you are most likely better off having a larger
array with all the data files on it. Now placing the log files on another
array is always a good idea.
--
Andrew J. Kelly SQL MVP
"Jirí Lejsek" <jlejsek@.na_volnym_v_cesku> wrote in message
news:urgxfhauHHA.1184@.TK2MSFTNGP04.phx.gbl...
>I have thought about it many times.
> When using more disks in one volume, random IO rises, but not by the same
> factor as number of disks.
> If load of these two files is the same, then two volumes are better, but
> if it is not, one volume is better.
> Are you sure that load is always uniformly distributed between files? Than
> use separate volumes.
> In real world the load always fluctuates so it is not possible to make a
> common conclusion.
> "Paul Cahill" <anon@.anon.com> píse v diskusním príspevku
> news:%237m9RqZuHHA.3368@.TK2MSFTNGP02.phx.gbl...
>>A user called ZoomZoom made the posting below on Tech Republic. As it did
>>not get a reply there I am interested to know what the community's opinion
>>is.
>> Some of my Raid 10's are made up of 8 drives. Would it have been better
>> to have created 2 Raid 10's of 4 drives and then have two data files in
>> the file group?
>> I would expect an 8 drive Raid 10 to give a higher sequential data rate
>> but what about random IO when one has many users?
>> Regard
>> Paul Cahill
>> ...
>> Posted by ZoomZoom
>>
>> I have seen many articles that say to place log files separate from db
>> files, and use raid 1+0 for the db... but I haven't found anything to
>> answer this question for me though (unless it should be so obvious that
>> I'm just not seeing it).
>> Is it faster for the database to have the tables split into separate
>> files and put each file in smaller raid 1+0 configurations or is it
>> faster to use the same number of drives in a single raid 1+0. For
>> example: 8 drives could be split into 2 raid 1+0 arrays (4 each) or they
>> could be configured in one large raid 1+0 array (8 drives). The end
>> result would be striping between 2 sets of 2 drives (in the 4 disk array
>> due to 2 being only mirrors) or striping between 4 drives (in the 8 disk
>> array).
>> I guess another way of looking at the question is how much performance
>> does each additional pair of disks in a raid 1+0 array really add. Can an
>> 8 disk array handle twice as many IO's as as a 4 disk array? Or does it
>> only add maybe 30% more IO ability? If the later, would it make sense to
>> add blocks of 4 disk arrays and split the database tables into multiple
>> files instead? Would this theory work like putting the logs on separate
>> drives from the database files?
>> I realize it's an expensive solution... but with the price of database
>> per-processor licenses being what they are... it makes sense to try to
>> squeeze as much performance out of your database server as possible.
>>
>|||On Thu, 28 Jun 2007 17:02:37 -0400, "Andrew J. Kelly"
<sqlmvpnooospam@.shadhawk.com> wrote:
>I pretty much agree with that explanation. If you know exactly how your
>files will be accessed and they compete with each other at a fairly
>intensive level you might get better performance from splitting them. This
>is true of separating tempdb from data files as well. But if you don't know
>or the load is not too high you are most likely better off having a larger
>array with all the data files on it. Now placing the log files on another
>array is always a good idea.
The problem I have with targetting files at particular drives is that
it tunes for one particular load only. Tune it for the daytime OLTP
and it could be sub-optimal during heavy duty batch processing at
night. One big striped set spreads the load pretty evenly across all
spindles, regardless of load, and regardless of how the load changes
over time. It might not be as fast in some special instances, but I
think on average one big set of drives is going to perform as well or
better. Targetting files also means one more thing for the DBA to
keep an eye on and worry about changing later when the load changes or
space grows differently than expected.
Logs, of course, are another story as everyone knows.
Roy Harvey
Beacon Falls, CT|||Right that is why I qualified it with knowing "Exactly" how they are used
which is hard to do.
--
Andrew J. Kelly SQL MVP
"Roy Harvey" <roy_harvey@.snet.net> wrote in message
news:v798831lqurvqepcvtb9q294c4ha95liq9@.4ax.com...
> On Thu, 28 Jun 2007 17:02:37 -0400, "Andrew J. Kelly"
> <sqlmvpnooospam@.shadhawk.com> wrote:
>>I pretty much agree with that explanation. If you know exactly how your
>>files will be accessed and they compete with each other at a fairly
>>intensive level you might get better performance from splitting them. This
>>is true of separating tempdb from data files as well. But if you don't
>>know
>>or the load is not too high you are most likely better off having a larger
>>array with all the data files on it. Now placing the log files on another
>>array is always a good idea.
> The problem I have with targetting files at particular drives is that
> it tunes for one particular load only. Tune it for the daytime OLTP
> and it could be sub-optimal during heavy duty batch processing at
> night. One big striped set spreads the load pretty evenly across all
> spindles, regardless of load, and regardless of how the load changes
> over time. It might not be as fast in some special instances, but I
> think on average one big set of drives is going to perform as well or
> better. Targetting files also means one more thing for the DBA to
> keep an eye on and worry about changing later when the load changes or
> space grows differently than expected.
> Logs, of course, are another story as everyone knows.
> Roy Harvey
> Beacon Falls, CT

More logical drives VS more striped drives

A user called ZoomZoom made the posting below on Tech Republic. As it did
not get a reply there I am interested to know what the community's opinion
is.
Some of my Raid 10's are made up of 8 drives. Would it have been better to
have created 2 Raid 10's of 4 drives and then have two data files in the
file group?
I would expect an 8 drive Raid 10 to give a higher sequential data rate but
what about random IO when one has many users?
Regard
Paul Cahill
...
Posted by ZoomZoom
I have seen many articles that say to place log files separate from db
files, and use raid 1+0 for the db... but I haven't found anything to answer
this question for me though (unless it should be so obvious that I'm just
not seeing it).
Is it faster for the database to have the tables split into separate files
and put each file in smaller raid 1+0 configurations or is it faster to use
the same number of drives in a single raid 1+0. For example: 8 drives could
be split into 2 raid 1+0 arrays (4 each) or they could be configured in one
large raid 1+0 array (8 drives). The end result would be striping between 2
sets of 2 drives (in the 4 disk array due to 2 being only mirrors) or
striping between 4 drives (in the 8 disk array).
I guess another way of looking at the question is how much performance does
each additional pair of disks in a raid 1+0 array really add. Can an 8 disk
array handle twice as many IO's as as a 4 disk array? Or does it only add
maybe 30% more IO ability? If the later, would it make sense to add blocks
of 4 disk arrays and split the database tables into multiple files instead?
Would this theory work like putting the logs on separate drives from the
database files?
I realize it's an expensive solution... but with the price of database
per-processor licenses being what they are... it makes sense to try to
squeeze as much performance out of your database server as possible.I have thought about it many times.
When using more disks in one volume, random IO rises, but not by the same
factor as number of disks.
If load of these two files is the same, then two volumes are better, but if
it is not, one volume is better.
Are you sure that load is always uniformly distributed between files? Than
use separate volumes.
In real world the load always fluctuates so it is not possible to make a
common conclusion.
"Paul Cahill" <anon@.anon.com> pse v diskusnm prspevku
news:%237m9RqZuHHA.3368@.TK2MSFTNGP02.phx.gbl...
>A user called ZoomZoom made the posting below on Tech Republic. As it did
>not get a reply there I am interested to know what the community's opinion
>is.
> Some of my Raid 10's are made up of 8 drives. Would it have been better to
> have created 2 Raid 10's of 4 drives and then have two data files in the
> file group?
> I would expect an 8 drive Raid 10 to give a higher sequential data rate
> but what about random IO when one has many users?
> Regard
> Paul Cahill
> ...
> Posted by ZoomZoom
>
> I have seen many articles that say to place log files separate from db
> files, and use raid 1+0 for the db... but I haven't found anything to
> answer this question for me though (unless it should be so obvious that
> I'm just not seeing it).
> Is it faster for the database to have the tables split into separate files
> and put each file in smaller raid 1+0 configurations or is it faster to
> use the same number of drives in a single raid 1+0. For example: 8 drives
> could be split into 2 raid 1+0 arrays (4 each) or they could be configured
> in one large raid 1+0 array (8 drives). The end result would be striping
> between 2 sets of 2 drives (in the 4 disk array due to 2 being only
> mirrors) or striping between 4 drives (in the 8 disk array).
> I guess another way of looking at the question is how much performance
> does each additional pair of disks in a raid 1+0 array really add. Can an
> 8 disk array handle twice as many IO's as as a 4 disk array? Or does it
> only add maybe 30% more IO ability? If the later, would it make sense to
> add blocks of 4 disk arrays and split the database tables into multiple
> files instead? Would this theory work like putting the logs on separate
> drives from the database files?
> I realize it's an expensive solution... but with the price of database
> per-processor licenses being what they are... it makes sense to try to
> squeeze as much performance out of your database server as possible.
>|||I pretty much agree with that explanation. If you know exactly how your
files will be accessed and they compete with each other at a fairly
intensive level you might get better performance from splitting them. This
is true of separating tempdb from data files as well. But if you don't know
or the load is not too high you are most likely better off having a larger
array with all the data files on it. Now placing the log files on another
array is always a good idea.
Andrew J. Kelly SQL MVP
"Jir Lejsek" <jlejsek@.na_volnym_v_cesku> wrote in message
news:urgxfhauHHA.1184@.TK2MSFTNGP04.phx.gbl...
>I have thought about it many times.
> When using more disks in one volume, random IO rises, but not by the same
> factor as number of disks.
> If load of these two files is the same, then two volumes are better, but
> if it is not, one volume is better.
> Are you sure that load is always uniformly distributed between files? Than
> use separate volumes.
> In real world the load always fluctuates so it is not possible to make a
> common conclusion.
> "Paul Cahill" <anon@.anon.com> pse v diskusnm prspevku
> news:%237m9RqZuHHA.3368@.TK2MSFTNGP02.phx.gbl...
>|||On Thu, 28 Jun 2007 17:02:37 -0400, "Andrew J. Kelly"
<sqlmvpnooospam@.shadhawk.com> wrote:

>I pretty much agree with that explanation. If you know exactly how your
>files will be accessed and they compete with each other at a fairly
>intensive level you might get better performance from splitting them. This
>is true of separating tempdb from data files as well. But if you don't know
>or the load is not too high you are most likely better off having a larger
>array with all the data files on it. Now placing the log files on another
>array is always a good idea.
The problem I have with targetting files at particular drives is that
it tunes for one particular load only. Tune it for the daytime OLTP
and it could be sub-optimal during heavy duty batch processing at
night. One big striped set spreads the load pretty evenly across all
spindles, regardless of load, and regardless of how the load changes
over time. It might not be as fast in some special instances, but I
think on average one big set of drives is going to perform as well or
better. Targetting files also means one more thing for the DBA to
keep an eye on and worry about changing later when the load changes or
space grows differently than expected.
Logs, of course, are another story as everyone knows.
Roy Harvey
Beacon Falls, CT|||Right that is why I qualified it with knowing "Exactly" how they are used
which is hard to do.
Andrew J. Kelly SQL MVP
"Roy Harvey" <roy_harvey@.snet.net> wrote in message
news:v798831lqurvqepcvtb9q294c4ha95liq9@.
4ax.com...
> On Thu, 28 Jun 2007 17:02:37 -0400, "Andrew J. Kelly"
> <sqlmvpnooospam@.shadhawk.com> wrote:
>
> The problem I have with targetting files at particular drives is that
> it tunes for one particular load only. Tune it for the daytime OLTP
> and it could be sub-optimal during heavy duty batch processing at
> night. One big striped set spreads the load pretty evenly across all
> spindles, regardless of load, and regardless of how the load changes
> over time. It might not be as fast in some special instances, but I
> think on average one big set of drives is going to perform as well or
> better. Targetting files also means one more thing for the DBA to
> keep an eye on and worry about changing later when the load changes or
> space grows differently than expected.
> Logs, of course, are another story as everyone knows.
> Roy Harvey
> Beacon Falls, CTsql

Monday, March 26, 2012

More connect feedback required please

[Microsoft follow-up]

I submitted a posting to connect here: https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=287213

Here is what I wrote:

A number of tasks output data to variables or text files. e.g. WMI Data Reader Task, Execute SQL Task, Web Service Task.

Consuming that output in a data-flow isn't particularly easy. You either need a Flat File source adapter (which requires an othrewise superfluous connection manager) or write come code in a script component to parse the recordset. There is simply no easy way to push data from these sources into the dataflow.

Thw built-in mechanism for passing data between different tasks is raw files. Currently they can only be used by the dataflow but I see no reason why they couldn't be used by other tasks as well. It makes complete sense to me for a WMI Datareader Task to push some data into a raw file and then we chew that data up in a dataflow.

The following response came back

Our current architecture actually doesn't have the buffer system as in Data Flow, when you are in the Control Flow. What you are asking would require us to build a similar buffer system in the Control Flow, which is a fundemantal architectural change. We'll not be able to take this, sorry.

I'm afraid I don't understand that response. Obviously I know that buffers are only in the data-flow - but I don't see why that's relevant. Raw files are just files on the file system, same as any other. OK, their format is very proprietary but its you guys that built the format. Essentially all I'm asking you to do is output the data in raw file format as opposed to flat file format. There's no notion of buffers in a raw file because its just a lump of data. Or is there? If not, I'm afraid I don't understand your answer.

Please could you clarify the answer?

-Jamie

Hi,

Are the [Microsoft follow-up] tags not working anymore? Smile

-Jamie

|||

They do work. Sometimes, is not easy to find a brave soul to catch a hot potato. Smile

I was waiting for somebody from the group that triaged this item to respond, but let me present my view:

The row files are basically data flow buffers streamed into files. More or less it is collection of "raw" data flow buffers copied byte by byte. It would be possible to mimic this format from other tasks but it would not be a trivial task. The data other tasks are dealing with is differently structured and it will probably need to go through chains of adaptations/conversions to make it conform to the metadata parameters understandable by the data flow.

I understand your motivation behind this request and it is more than sensible. I just wish we had a standardized metadata system across runtime and pipeline that would allow us to easily implement this.

Thanks,

Bob

|||

Bob Bojanic - MSFT wrote:

They do work. Sometimes, is not easy to find a brave soul to catch a hot potato.

Fair enough. Credit to you then Bob for picking it up!

Bob Bojanic - MSFT wrote:

I was waiting for somebody from the group that triaged this item to respond, but let me present my view:

The row files are basically data flow buffers streamed into files. More or less it is collection of "raw" data flow buffers copied byte by byte. It would be possible to mimic this format from other tasks but it would not be a trivial task. The data other tasks are dealing with is differently structured and it will probably need to go through chains of adaptations/conversions to make it conform to the metadata parameters understandable by the data flow.

I understand your motivation behind this request and it is more than sensible. I just wish we had a standardized metadata system across runtime and pipeline that would allow us to easily implement this.

Thanks,

Bob

OK, that kinda makes sense. Here's another even hotter potato for you then. Why DON'T you have "a standardized metadata system across runtime and pipeline ". It has always seemed strange to me that (e.g.) variable data types are different to pipeline data types.

-Jamie

|||

Jamie Thomson wrote:


OK, that kinda makes sense. Here's another even hotter potato for you then. Why DON'T you have "a standardized metadata system across runtime and pipeline ". It has always seemed strange to me that (e.g.) variable data types are different to pipeline data types.

-Jamie

HAHAHA! You know, that's so true. If, in your example, we had consistent data types between variables, pipelines, parameter mappings, etc... consumer adoption of SSIS would be much greater and our jobs on this forum would be made SO much easier.

|||

Jamie Thomson wrote:

Here's another even hotter potato for you then. Why DON'T you have "a standardized metadata system across runtime and pipeline ". It has always seemed strange to me that (e.g.) variable data types are different to pipeline data types.

Hmm. I can try to answer it but have to watch not to burn my fingers too much. Smile

IMHO, and I can be off since these designs/implementations date before my time with this group, we focused too much on breaking DTS into two functional modules (runtime and pipeline) that we failed to realize some obvious synergies they have. We started building two separate modules from the start and divided our resources around them. That resulted in separate designs, implementations and overall look and feel. We even had two hamsters, at some point, named "Runtime" and "Pipeline". Smile

Again, this is only my view and other people on the team might have different insights. I wish we did a better job on this, and I believe we should have a serious thinking around making it better. It is not likely we will be able to improve much of it in Katmai, though.

Also, to be fair, coming up with standardized data type system that could be used for many purposes is quite a complex task given the ambiguity of implementations in existing data sources and apparent lack of standardization. Even seemingly simple tool like the Import/Export wizard, which is supposed to move data between all sorts of heterogeneous data sources, faces loads of problems in its quest to make the transfers "simply work". In a perfect world, we should have had standardized implementations of data type systems, provided by a platform (OS), all other layers (DBs, CLR, tools and services) can plug into, and we would not need to worry about these things.

Anyway, I hope this adds some light…

Thanks,

Bob

|||

Wow. That's a very candid response Bob. Thank you very much, I really appreciate the honesty.

As always, if we in the community can help in any way to shape what happens here in the future then come and knock on our door.


-Jamie

|||

Thank you, Jamie, for keeping us honest.

The community is here to keep us both; inspired and humble, at the same time. I like it for that.

|||

Now here's something I keep thinking about

SSIS trying to be all things to all people. Why? (the telephone toaster spring to mind at this point)

Oracle and SQL are both now excellent products. Chances are your typical Oracle system will appear to be better because the design of it will be of higher quality. This is only because the barriers to entry are so high, unlike MS where any muppet can come along and design some crappy Access database, migrate using a wizard, and then think they are a dba. This is known and historic and as time goes by the overall design quality of SQL db will meet that of Oracle.

This is also true because most SQL installations originally came about because people didn't or couldn't or wouldn't spend the money on hiring Oracle and all their costly consultants to come and set them up a db. But also they wouldn't spend the money on training people how to use SQL properly either.

This means that any company using Oracle or SQL is likely to stay doing so, and any new companies are likely to use SQL. It's "cheaper". MS cleverly understand that the extra costs of getting MS stuff going is not measurable on the balance sheet Smile

So assuming this logic, I would bet money that the number of people using SSIS in a non SQL environment is on a par with the number of people using obscure (non IE/FF) browsers.

If you had a website, you wouldn't even bother spending the money trying to make it work with the obscure browser. It is not worth the pain.

So assume that no CIO in their right mind is going to sanction the switching of a (probably working very nicely thankyou) Oracle ETL tool, to SSIS. Why do MS persist with this strategy of trying to make SSIS all things to all people?

Just make all of the SSIS datatypes the same as SQL! That's where the data is going to largely end up anyway.

I would bet that 99% of SQL connection managers are the following:

sql connection; flat file/csv connection; excel connection.

For the 1% (and please someone from MS correct me on these figures - you do after all capture 'usage' information) of other datasources, why not have dataconverters, to just convert them all to sql datatypes?

|||Personally, I have done a fair percentage of my work in SSIS against non-SQL Server databases, so I really like their strategy. If SSIS didn't work with other databases, it would be fairly useless as an enterprise ETL tool.|||

jwelch wrote:

Personally, I have done a fair percentage of my work in SSIS against non-SQL Server databases, so I really like their strategy. If SSIS didn't work with other databases, it would be fairly useless as an enterprise ETL tool.

Couldn't agree more. Many of the SSIS I have built don't go near SQL Server except for logging purposes.

-Jamie

|||

I'm just basing it on my experience of the FTSE listed companies I have worked/consulted at since 1995.

Out of 10, only 1 was Oracle, the rest were SQL and only SQL.

I also don't remember attending any interviews at places where they were also using Oracle etc.

For every multinational giant, there are hundreds of SMEs all running SQL.

--

So if you have another DB and SQL. You are still having to convert to SSIS types along the way.

At least if they used SQL types, you would only have convert once.

|||

We do not live inside SQL Server and simply reusing the SQL types is not going to help us much under the hoods. There is still additional layer between us and SQL engine and conversions are not easy to avoid. It would possibly make the exposed semantics simpler, but only if we were able to consolidate the data type system in the entire product.

The devil lies in details -- making these conversions transparent and seamless is quite delicate business, and its complexity jumps exponentially as we increase the number of considered data sources.

Thanks,

Bob

|||

Product suggestion:

Allow the package to have an optional mode.

"CommonDataTypeMode"

Which you can set to SQL Server, Oracle etc

By setting this is will automatically default all of your datatypes to the chosen mode. e.g. any datetime will default to DT_DBTIMESTAMP

I am wasting so much time having to go in and 'correct'* data type information.

What is gained by having it:

Less time spent typing into miniscule textboxes.

Fewer errors due to always having the correct type.

*I have also raised a bug on the connect site regarding SSIS changing your datatypes when you change an expression in the derived column editor. I haven't heard anything back and don't expect to.

sql

More connect feedback required please

[Microsoft follow-up]

I submitted a posting to connect here: https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=287213

Here is what I wrote:

A number of tasks output data to variables or text files. e.g. WMI Data Reader Task, Execute SQL Task, Web Service Task.

Consuming that output in a data-flow isn't particularly easy. You either need a Flat File source adapter (which requires an othrewise superfluous connection manager) or write come code in a script component to parse the recordset. There is simply no easy way to push data from these sources into the dataflow.

Thw built-in mechanism for passing data between different tasks is raw files. Currently they can only be used by the dataflow but I see no reason why they couldn't be used by other tasks as well. It makes complete sense to me for a WMI Datareader Task to push some data into a raw file and then we chew that data up in a dataflow.

The following response came back

Our current architecture actually doesn't have the buffer system as in Data Flow, when you are in the Control Flow. What you are asking would require us to build a similar buffer system in the Control Flow, which is a fundemantal architectural change. We'll not be able to take this, sorry.

I'm afraid I don't understand that response. Obviously I know that buffers are only in the data-flow - but I don't see why that's relevant. Raw files are just files on the file system, same as any other. OK, their format is very proprietary but its you guys that built the format. Essentially all I'm asking you to do is output the data in raw file format as opposed to flat file format. There's no notion of buffers in a raw file because its just a lump of data. Or is there? If not, I'm afraid I don't understand your answer.

Please could you clarify the answer?

-Jamie

Hi,

Are the [Microsoft follow-up] tags not working anymore? Smile

-Jamie

|||

They do work. Sometimes, is not easy to find a brave soul to catch a hot potato. Smile

I was waiting for somebody from the group that triaged this item to respond, but let me present my view:

The row files are basically data flow buffers streamed into files. More or less it is collection of "raw" data flow buffers copied byte by byte. It would be possible to mimic this format from other tasks but it would not be a trivial task. The data other tasks are dealing with is differently structured and it will probably need to go through chains of adaptations/conversions to make it conform to the metadata parameters understandable by the data flow.

I understand your motivation behind this request and it is more than sensible. I just wish we had a standardized metadata system across runtime and pipeline that would allow us to easily implement this.

Thanks,

Bob

|||

Bob Bojanic - MSFT wrote:

They do work. Sometimes, is not easy to find a brave soul to catch a hot potato.

Fair enough. Credit to you then Bob for picking it up!

Bob Bojanic - MSFT wrote:

I was waiting for somebody from the group that triaged this item to respond, but let me present my view:

The row files are basically data flow buffers streamed into files. More or less it is collection of "raw" data flow buffers copied byte by byte. It would be possible to mimic this format from other tasks but it would not be a trivial task. The data other tasks are dealing with is differently structured and it will probably need to go through chains of adaptations/conversions to make it conform to the metadata parameters understandable by the data flow.

I understand your motivation behind this request and it is more than sensible. I just wish we had a standardized metadata system across runtime and pipeline that would allow us to easily implement this.

Thanks,

Bob

OK, that kinda makes sense. Here's another even hotter potato for you then. Why DON'T you have "a standardized metadata system across runtime and pipeline ". It has always seemed strange to me that (e.g.) variable data types are different to pipeline data types.

-Jamie

|||

Jamie Thomson wrote:


OK, that kinda makes sense. Here's another even hotter potato for you then. Why DON'T you have "a standardized metadata system across runtime and pipeline ". It has always seemed strange to me that (e.g.) variable data types are different to pipeline data types.

-Jamie

HAHAHA! You know, that's so true. If, in your example, we had consistent data types between variables, pipelines, parameter mappings, etc... consumer adoption of SSIS would be much greater and our jobs on this forum would be made SO much easier.

|||

Jamie Thomson wrote:

Here's another even hotter potato for you then. Why DON'T you have "a standardized metadata system across runtime and pipeline ". It has always seemed strange to me that (e.g.) variable data types are different to pipeline data types.

Hmm. I can try to answer it but have to watch not to burn my fingers too much. Smile

IMHO, and I can be off since these designs/implementations date before my time with this group, we focused too much on breaking DTS into two functional modules (runtime and pipeline) that we failed to realize some obvious synergies they have. We started building two separate modules from the start and divided our resources around them. That resulted in separate designs, implementations and overall look and feel. We even had two hamsters, at some point, named "Runtime" and "Pipeline". Smile

Again, this is only my view and other people on the team might have different insights. I wish we did a better job on this, and I believe we should have a serious thinking around making it better. It is not likely we will be able to improve much of it in Katmai, though.

Also, to be fair, coming up with standardized data type system that could be used for many purposes is quite a complex task given the ambiguity of implementations in existing data sources and apparent lack of standardization. Even seemingly simple tool like the Import/Export wizard, which is supposed to move data between all sorts of heterogeneous data sources, faces loads of problems in its quest to make the transfers "simply work". In a perfect world, we should have had standardized implementations of data type systems, provided by a platform (OS), all other layers (DBs, CLR, tools and services) can plug into, and we would not need to worry about these things.

Anyway, I hope this adds some light…

Thanks,

Bob

|||

Wow. That's a very candid response Bob. Thank you very much, I really appreciate the honesty.

As always, if we in the community can help in any way to shape what happens here in the future then come and knock on our door.


-Jamie

|||

Thank you, Jamie, for keeping us honest.

The community is here to keep us both; inspired and humble, at the same time. I like it for that.

|||

Now here's something I keep thinking about

SSIS trying to be all things to all people. Why? (the telephone toaster spring to mind at this point)

Oracle and SQL are both now excellent products. Chances are your typical Oracle system will appear to be better because the design of it will be of higher quality. This is only because the barriers to entry are so high, unlike MS where any muppet can come along and design some crappy Access database, migrate using a wizard, and then think they are a dba. This is known and historic and as time goes by the overall design quality of SQL db will meet that of Oracle.

This is also true because most SQL installations originally came about because people didn't or couldn't or wouldn't spend the money on hiring Oracle and all their costly consultants to come and set them up a db. But also they wouldn't spend the money on training people how to use SQL properly either.

This means that any company using Oracle or SQL is likely to stay doing so, and any new companies are likely to use SQL. It's "cheaper". MS cleverly understand that the extra costs of getting MS stuff going is not measurable on the balance sheet Smile

So assuming this logic, I would bet money that the number of people using SSIS in a non SQL environment is on a par with the number of people using obscure (non IE/FF) browsers.

If you had a website, you wouldn't even bother spending the money trying to make it work with the obscure browser. It is not worth the pain.

So assume that no CIO in their right mind is going to sanction the switching of a (probably working very nicely thankyou) Oracle ETL tool, to SSIS. Why do MS persist with this strategy of trying to make SSIS all things to all people?

Just make all of the SSIS datatypes the same as SQL! That's where the data is going to largely end up anyway.

I would bet that 99% of SQL connection managers are the following:

sql connection; flat file/csv connection; excel connection.

For the 1% (and please someone from MS correct me on these figures - you do after all capture 'usage' information) of other datasources, why not have dataconverters, to just convert them all to sql datatypes?

|||Personally, I have done a fair percentage of my work in SSIS against non-SQL Server databases, so I really like their strategy. If SSIS didn't work with other databases, it would be fairly useless as an enterprise ETL tool.|||

jwelch wrote:

Personally, I have done a fair percentage of my work in SSIS against non-SQL Server databases, so I really like their strategy. If SSIS didn't work with other databases, it would be fairly useless as an enterprise ETL tool.

Couldn't agree more. Many of the SSIS I have built don't go near SQL Server except for logging purposes.

-Jamie

|||

I'm just basing it on my experience of the FTSE listed companies I have worked/consulted at since 1995.

Out of 10, only 1 was Oracle, the rest were SQL and only SQL.

I also don't remember attending any interviews at places where they were also using Oracle etc.

For every multinational giant, there are hundreds of SMEs all running SQL.

--

So if you have another DB and SQL. You are still having to convert to SSIS types along the way.

At least if they used SQL types, you would only have convert once.

|||

We do not live inside SQL Server and simply reusing the SQL types is not going to help us much under the hoods. There is still additional layer between us and SQL engine and conversions are not easy to avoid. It would possibly make the exposed semantics simpler, but only if we were able to consolidate the data type system in the entire product.

The devil lies in details -- making these conversions transparent and seamless is quite delicate business, and its complexity jumps exponentially as we increase the number of considered data sources.

Thanks,

Bob

|||

Product suggestion:

Allow the package to have an optional mode.

"CommonDataTypeMode"

Which you can set to SQL Server, Oracle etc

By setting this is will automatically default all of your datatypes to the chosen mode. e.g. any datetime will default to DT_DBTIMESTAMP

I am wasting so much time having to go in and 'correct'* data type information.

What is gained by having it:

Less time spent typing into miniscule textboxes.

Fewer errors due to always having the correct type.

*I have also raised a bug on the connect site regarding SSIS changing your datatypes when you change an expression in the derived column editor. I haven't heard anything back and don't expect to.