Forums
Games
Cyberpunk 2077 Thronebreaker: The Witcher Tales GWENT®: The Witcher Card Game The Witcher 3: Wild Hunt The Witcher 2: Assassins of Kings The Witcher The Witcher Adventure Game
Jobs Store Support Log in Register
Forums - CD PROJEKT RED
Menu
Forums - CD PROJEKT RED
  • Hot Topics
  • NEWS
  • GENERAL
    THE WITCHER ADVENTURE GAME
  • STORY
    THE WITCHER THE WITCHER 2 THE WITCHER 3 THE WITCHER TALES
  • GAMEPLAY
    THE WITCHER THE WITCHER 2 THE WITCHER 3 MODS (THE WITCHER) MODS (THE WITCHER 2) MODS (THE WITCHER 3)
  • TECHNICAL
    THE WITCHER THE WITCHER 2 (PC) THE WITCHER 2 (XBOX) THE WITCHER 3 (PC) THE WITCHER 3 (PLAYSTATION) THE WITCHER 3 (XBOX) THE WITCHER 3 (SWITCH)
  • COMMUNITY
    FAN ART (THE WITCHER UNIVERSE) FAN ART (CYBERPUNK UNIVERSE) OTHER GAMES
  • RED Tracker
    The Witcher Series Cyberpunk GWENT
FAN ART (THE WITCHER UNIVERSE)
FAN ART (CYBERPUNK UNIVERSE)
OTHER GAMES
Menu

Register

Building a gaming PC

+
Prev
  • 1
  • …

    Go to page

  • 21
  • 22
  • 23
  • 24
  • 25
  • …

    Go to page

  • 154
Next
First Prev 23 of 154

Go to page

Next Last
G

GuyNwah

Ex-moderator
#441
Jan 22, 2015
.Volsung. said:
which of the two? different frames or different parts of a frame?

Maybe I'm judging based on my lack of graphics knowledge, but in the first case I don't see any data dependencies. Each frame comes with its own dataset and whatever other specifications and it should be possible to render each separately, wthout the entire set of resources. If additional information is needed (border issues, as in previous and next frame) then an appropriate distribution scheme is splitting in overlapping segments of n + a frames for each processing unit, and processing only the n frames. If each frame is split for rendering (which leads to additionaly communication costs for putting together the final result) then use the same scheme mentioned above for data partitioning except within a single frame (e.g. for a large matrix, row splitting with overlap if neighbors are necessary).

I don't know what the input and output representations are and how much bandwidth they need, but I'd think PCI Express is sufficiently fast to do this while keeping the load balancing overhead low.

This is a simple approach used everyday in parallel and high performance computing, for instance applicable to solve large linear systems using (non-contiguous) distributed memory computer architectures (a computing cluster for instance) and LARGE amounts of RAM (more than 32 GB).

I suppose the reasons are pragmatic. Keeping copies of everything is much simpler.
Click to expand...
Consecutive frames in AFR, stripes in SFR, rectangles in checkerboard. In action, consecutive frames will not differ by much in content. In fact, they will have almost the exact same datasets.

The difference between reading the resources needed to render the frame from local VRAM vs. involving the output processors (which are already heavily loaded), PCI-e interface, memory controller, and system memory is enough to render the latter impractical.

Unified memory architectures such as we are getting on AMD APUs and sort of getting on Maxwell may change this picture. But the older generation cards have to render frames from resources already in VRAM.
 
Last edited: Jan 22, 2015
V

volsung

Forum veteran
#442
Jan 22, 2015
Guy N'wah said:
Consecutive frames in AFR, stripes in SFR, rectangles in checkerboard. In action, consecutive frames will not differ by much in content. In fact, they will have almost the exact same datasets.

The difference between reading the resources needed to render the frame from local VRAM vs. involving the output processors (which are already heavily loaded), PCI-e interface, memory controller, and system memory is enough to render the latter impractical.

Unified memory architectures such as we are getting on AMD APUs and sort of getting on Maxwell may change this picture. But the older generation cards have to render frames from resources already in VRAM.
Click to expand...
Sorry for insisting but I am really curious. From a parallel computing perspective I don't see why SLI works this way, but I honestly don't know much (i.e. anything) about game graphics.

Assuming I had CUDA devices 1 and 2, and one large data set (e.g. a large matrix) A on which I have to perform SIMD parallel code f, instead of copying A on both 1 and 2 (two calls of mem copy) and parameterizing f with respect to each device (as a function of device number, thread id and positional information) on A it would be possible to copy one half of A on each CUDA device (still two calls of mem copy) and parameterize f differently (minus device number). In the end, this would allow me to use matrices twice as large with the same amount of memory operations. One device doesn't have to read the other device's memory. The two matrices could be partitioned in halves determined by rows (stripes), submatrices (rectangles), or whole matrices if my data set consists of more than one (frames).

Once things are copied on the device's memory, everything would be read from the local VRAM, the rest of the memory hierarchy (such as the buses and main memory) would be involved in the same way as if both CUDA devices had copies of the entire data set. This type of approach is applied succesfully in distributed memory computing clusters all over the world, using message-passing through high-performance dedicated networks (usually based on Infiniband or derivatives) to split data across different nodes where fine-grained multithreaded code can run on multiple cores or multilpe GPU's. With the right design [complexity function of f, obtain parallel overhead and determine the appropriate ratio of input size / processors with respect to a communications model (of a network or internal buses)] this should increase efficiency and therefore also speedup. Algebraic methods such as those involved in graphics (for which GPU's were designed) scale particularly well.

I suppose game graphics don't behave like this though, and they are probably peculiar in that small scale response time is such a sensitive requirement that they're willing to sacrifice efficiency for raw speed? I'm assuming there's also some kind of workflow I am ignoring completely (for instance involving textures and shading or whatever), preventing a "good" utilization of the device.

Edit:

Ultimately I wonder how much of this is the API's responsibility and how much is the game programmers'. It seems to me at the moment they are using the easy way out, not the most efficient one.

Edit 2:

I wonder if this is due to having only one card in charge of the output, with the extra card(s) feeding ready-to-output data to the main card's memory, creating the restriction of using just as much memory as the output card has. This theory might explain this bit a little (for game graphics) and why this makes no sense in the world of HPC.
 
Last edited: Jan 22, 2015
Z

Zabanzo

Senior user
#443
Jan 22, 2015
.Volsung. said:
$150 is a wide gap between the 960 and the 970. Is there a similar gap in performance? I'm guessing if there is, an eventual 960 Ti will sit in between the two.
Click to expand...
I really hope there will be a Ti version with a bit more power and 4GB of VRAM. I read one article some time ago, where they talked about 2 more middle-range Nvidia GPUs 960Ti and 965Ti.
 
G

GuyNwah

Ex-moderator
#444
Jan 22, 2015
There are two problems with distributed resources:

One, you don't have a way to predict "this resource will be used only in the lower half of the frame, or in even frames, etc." So there's no sensible partitioning of the resources that has some on one GPU and some on the other.

Two, you don't have a low-cost mechanism for sharing resources that are on one but not the other. With unified memory, you could have a PCI-e transfer between cards. But unified memory on enough cards to make a new optimized SLI or Crossfire is not here yet. Without unified memory, you have a high-overhead path that involves creating two transfers, using system RAM as the intermediary, and using some amount of CPU to supervise it.

And finally, RAM is cheap, stone-simple partitioning is cheap and works well on old cards, slinging resources between cards is expensive, and the time a card can render a frame or stripe or checkerboard just from the contents of its own memory is what limits performance.
 
Last edited: Jan 22, 2015
sidspyker

sidspyker

Ex-moderator
#445
Jan 23, 2015
While I have nothing better to do, I disabled 2 of my 6 cores(making it a quad) and raised the freq to 3800 and voltage to 1.4 just incase, Prime95'ing now, hope it turns out to be stable enough. I've had a terrible OC experience over the years so far but for some reason I continue to repeat my mistakes... hoping I'll learn :p
 
yayodeanno.831

yayodeanno.831

Forum veteran
#446
Jan 23, 2015
I'll just leave this here... :)
 
  • RED Point
Reactions: CopyCatLikeNoOther
V

volsung

Forum veteran
#447
Jan 23, 2015
Guy N'wah said:
There are two problems with distributed resources:

One, you don't have a way to predict "this resource will be used only in the lower half of the frame, or in even frames, etc." So there's no sensible partitioning of the resources that has some on one GPU and some on the other.

Two, you don't have a low-cost mechanism for sharing resources that are on one but not the other. With unified memory, you could have a PCI-e transfer between cards. But unified memory on enough cards to make a new optimized SLI or Crossfire is not here yet. Without unified memory, you have a high-overhead path that involves creating two transfers, using system RAM as the intermediary, and using some amount of CPU to supervise it.

And finally, RAM is cheap, stone-simple partitioning is cheap and works well on old cards, slinging resources between cards is expensive, and the time a card can render a frame or stripe or checkerboard just from the contents of its own memory is what limits performance.
Click to expand...
I think the issue here is that real-time graphics rendering imposes certain restrictions on what would otherwise be normal distribution schemes used in HPC every day. Speaking in terms of pure GPGPU for numerical computation:

1) You don't have to predict where resources will be allocated because you do can do it either explicitly or parametrically. You know device i has chunk i, and you know how and where it belongs in the complete dataset (even regardless of what physical device i is mapped to).

2) Resources do not have to be shared, that's why this is a distributed-memory approach. You copy the data each device requires and proceed to compute whatever has to be done with no inter-device communication. Most matrix and vector based solutions are linear systems and can be easily separated, [i.e. f(yA) = yf(A) and f(A + B ) = f(A) + f(B )]. However like you say, perhaps unified GPU memory may allow some efficient inter-device memory operations.

I think due to my lack of knowledge in real-time graphics rendering I didn't quite get what you meant, but I suppose that there must be some "slinging resources between cards" because ultimately one GPU must convert data to a video signal. In a forum somebody mentioned the second GPU must copy its results "in the same memory location" on GPU1, so not having copies would complicate this situation. It would require additional computation to put it all back together, and I suppose the current belief is that it is preferable to use resources for graphics only. I see this as more of an "enterprise solution that works for everybody" rather than trying to maximize efficiency, at the cost of more complicated programming and/or resource allocation.

In parallel computing, GPGPU devices are independent and usually there is no requirement of one card performing some kind of real-time output, so we can normally use all of their memories for different things or different parts of the same thing (in fact Tesla cards don't even have video outputs, they're just processing units). Same principle as cluster computing and data distribution over numerous independent address spaces.

Thanks for your input. I think I'm satisfied now :)
 
sidspyker

sidspyker

Ex-moderator
#448
Jan 23, 2015
yayodeanno said:
I'll just leave this here... :)
Click to expand...
Yeah I tried that today, falls right after 3200MB
 
V

volsung

Forum veteran
#449
Jan 23, 2015
sidspyker said:
While I have nothing better to do, I disabled 2 of my 6 cores(making it a quad) and raised the freq to 3800 and voltage to 1.4 just incase, Prime95'ing now, hope it turns out to be stable enough. I've had a terrible OC experience over the years so far but for some reason I continue to repeat my mistakes... hoping I'll learn :p
Click to expand...
I think we have the same CPU (Phenom II X6 1090T) and I managed to OC to 3.7 GHz with all 6 cores using 1.425 v, and it has been stable for years. Recently tried increasing it to 3.8 and 1.475 v. and it didn't work quite as well. I think my motherboard has some voltage issues.

Also I use a Thermaltake Contac 30 heatsink with one 120mm fan and temperatures never go above 55º. I had no idea what I was doing but this CPU seems to overclock easily.
 
  • RED Point
Reactions: sidspyker
eskiMoe

eskiMoe

Mentor
#450
Jan 23, 2015
sidspyker said:
Yeah I tried that today, falls right after 3200MB
Click to expand...
Mine falls after 3GB.

But in games it only starts to stutter if I go beyond 3,6GB VRAM usage.

Very strange.
 
Kinley

Kinley

Ex-moderator
#451
Jan 23, 2015
Yeah mine drops after 3 as well.
 
V

volsung

Forum veteran
#452
Jan 23, 2015
Kinley said:
Yeah mine drops after 3 as well.
Click to expand...
It almost looks like you're playing TW3.
 
Kinley

Kinley

Ex-moderator
#453
Jan 23, 2015
.Volsung. said:
It almost looks like you're playing TW3.
Click to expand...
I wish. :(
 
eskiMoe

eskiMoe

Mentor
#454
Jan 23, 2015
From ocnet forum:

From the Nai's Benchmark, assuming if the allocation is caused by disabled of SMM units, and different bandwidth for each different gpus once Nai's Benchmark memory allocation reaches 2816MiBytes to 3500MiBytes range, I can only assume this is caused by the way SMM units being disabled.

Allow me to elaborate my assumption. As we know, there are four raster engines for GTX 970 and GTX 980.
Each raster engine has four SMM units. GTX 980 has full SMM units for each raster engine, so there are 16 SMM units.

GTX970 is made by disabling 3 of SMM units. What nvidia refused to told us is which one of the raster engine has its SMM unit being disabled.
I found most reviewers simply modified the high level architecture overview of GTX 980 diagram by removing one SMM unit for each three raster engine with one raster engine has four SMM unit intact.

First scenario
What if the first (or the second, third, fourth) raster engine has its 3 SMM units disabled instead of evenly spread across four raster engine?

Second scenario
Or, first raster engine has two SMM units disabled and second raster engine has one SMM unit disabled?

Oh, please do notice the memory controller diagram for each of the raster engine too. >.< If we follow the first scenario, definitely, the raster engine will not be able to make fully use of the memory controller bandwidth



64bit memory controller, total 4 memory controllers = 256 bit memory controller.
Assuming if there are 3 raster engines with each three has one SMM disabled leaving 1 raster engine with 4 SMM intact.
Mathematically ;
16 SMM = 256 bit = 4096 Mb
13 SMM = 208 bit = 3328 Mb

208 bit = effective width after disabling SMM with 256 bit being actual memory controller width

IT is hardware problem=GTX970 is 208bit card.
Click to expand...
 
T

tahirahmed

Rookie
#455
Jan 23, 2015
@eskimoe



Quoted from

http://www.overclock.net/t/1535502/gtx-970s-can-only-use-3-5gb-of-4gb-vram-issue/170
 
Gilrond-i-Virdan

Gilrond-i-Virdan

Forum veteran
#456
Jan 23, 2015
So basically GTX970 can never fully use all 4 GB?
 
T

tahirahmed

Rookie
#457
Jan 23, 2015
Gilrond said:
So basically GTX970 can never fully use all 4 GB?
Click to expand...
Yes and more symptoms indicate it being hardware a issue.
 
eskiMoe

eskiMoe

Mentor
#458
Jan 23, 2015
tahirahmed said:
@eskimoe



Quoted from

http://www.overclock.net/t/1535502/gtx-970s-can-only-use-3-5gb-of-4gb-vram-issue/170
Click to expand...
I know. Ran the test myself and there's a huge drop after 3GB VRAM usage.

Thinking about contacting the store I bought these from and asking for a refund..
 
T

tahirahmed

Rookie
#459
Jan 23, 2015
eskimoe said:
I know. Ran the test myself and there's a huge drop after 3GB VRAM usage.

Thinking about contacting the store I bought these from and asking for a refund..
Click to expand...
That would be wise I think. Investing 700$ in GPU section only and having 3.5 GB vram due to a fault really is a bummer. Though if you have some days left on your return policy then I say wait a bit since Nvidia is investigating the issue to see what they have to say about this.

Sadly I was the one suggesting this card like hell to everyone looking for upgrade....
 
V

volsung

Forum veteran
#460
Jan 23, 2015
Mine's on its way to MSI, RMA'd for a different issue. I hope they will take this into account and find a suitable solution.
 
Prev
  • 1
  • …

    Go to page

  • 21
  • 22
  • 23
  • 24
  • 25
  • …

    Go to page

  • 154
Next
First Prev 23 of 154

Go to page

Next Last
Share:
Facebook Twitter Reddit Pinterest Tumblr WhatsApp Email Link
  • English
    English Polski (Polish) Deutsch (German) Русский (Russian) Français (French) Português brasileiro (Brazilian Portuguese) Italiano (Italian) 日本語 (Japanese) Español (Spanish)

STAY CONNECTED

Facebook Twitter YouTube
CDProjekt RED Mature 17+
  • Contact administration
  • User agreement
  • Privacy policy
  • Cookie policy
  • Press Center
© 2018 CD PROJEKT S.A. ALL RIGHTS RESERVED

The Witcher® is a trademark of CD PROJEKT S. A. The Witcher game © CD PROJEKT S. A. All rights reserved. The Witcher game is based on the prose of Andrzej Sapkowski. All other copyrights and trademarks are the property of their respective owners.

Forum software by XenForo® © 2010-2020 XenForo Ltd.